wave Hi everyone I don t know if this is the right place bu typesense #community-help

:wave: Hi everyone! I don't know if this is the r...

Kevin Donovan

06/30/2022, 1:04 PM

👋 Hi everyone! I don't know if this is the right place, but I have just discovered typesense and am trying to get it to work. Would anyone know where to look? I have described a particular problem on Stack Overflow. Thanks!

Jason Bosco

06/30/2022, 3:28 PM

@Kevin Donovan I don't see any errors in the logs you posted. That's just the scraper doing it's thing and indexing docs in Typesense. You want to wait for it to fully complete before trying to search

Kevin Donovan

07/01/2022, 6:19 AM

Thanks Jason.

Kevin Donovan

07/01/2022, 6:21 AM

You wouldn't have any comments on the env and config files, would you? Thanks again.

Jason Bosco

07/01/2022, 3:55 PM

The env and config files look fine to me

Kevin Donovan

07/04/2022, 12:00 PM

Then would you have any idea why the search box on our Docusaurus site just hangs there? Thanks!

Kevin Donovan

07/04/2022, 1:48 PM

Actually I figued this out. The typesenseCollectionName: parameter was not set. When this parameter is set, then the search box will not hang.

👍 1

Kevin Donovan

07/04/2022, 3:36 PM

There is a new problem - how to run the docsearch scraper on a URL that contains a port number? Is there something called 'allowed_domains' that can be modified to accept URLs with ports?

Jason Bosco

07/04/2022, 8:52 PM

The scraper should work with port numbers already…

Kevin Donovan

07/05/2022, 3:02 PM

It seems that it works by default only with ports 80 and 443. Port 80 has to forwarded to port 3000 in order for docsearch to scrape port 3000. There is a discussion of this on github here - https://github.com/typesense/typesense/issues/628#issuecomment-1174040085.

Kevin Donovan

07/05/2022, 3:21 PM

The heart of the problem is that docsearch 'finds' the different web pages in sitemap.xml but when it attempts to crawl through them it substitutes the IP address of the localhost with the URL of the organization. It cannot find the docusaurus pages under the URL for the reason that they are just not there! If this substitution could be prevented or compensated for, then I could run docsearch on my development site,

Jason Bosco

07/05/2022, 6:02 PM

Ah my bad, I didn't realize this was a limitation of scrappy, which is the underlying library the docsearch-scraper uses...

Kevin Donovan

07/06/2022, 7:55 AM

Thanks. How did you know that this was a limitation of scrappy?

Kevin Donovan

07/06/2022, 9:45 AM

And would you know of a workaround? Kind regards

Jason Bosco

07/06/2022, 4:08 PM

I figured that out from the stacktrace posted in the issue:

Copy code

WARNING:py.warnings:/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/
site-packages/scrapy/spidermiddlewares/offsite.py:69: PortWarning: allowed_domains 
accepts only domains without ports. Ignoring entry host.docker.internal:3000 in 
allowed_domains.  warnings.warn(message, PortWarning)

The file that's throwing that exception is inside the scrapy package. I also searched their source code for that error message, and confirmed that the error is coming from within scrappy. re: workaround, the IP Tables based approach mentioned in the Github issue shared seems to work.

Kevin Donovan

07/07/2022, 7:33 AM

Redirecting the port helped resolve one problem, but unfortunately scrapy 'converted' the IP address that was passed in docker to the URL of the the organization. A new set of error messages appeared.

DEBUG:scrapy.core.engine:Crawled (404) <GET <https://www.algotrader.com/docs/virtual_spot_positions>> (referer: <http://host.docker.internal:3000/sitemap.xml>)

Kevin Donovan

07/07/2022, 7:55 AM

scrappy can read the sitemap.xml file via the redirected port, since it was able to detect

virtual_spot_positions

which is the name of a page in the docusaurus site. Unfortunately, it subsitutes the IP address implied by

host.docker.interal

with the URL of the organization. And it can't find

virtual_spot_positions

of course. Arrgh. Redirecting the port does solve one problem but led in this case to another. I will attempt to duplicate the environment that was described in the GitHub posting. In the case where I encountered this problem, the docusaurus site was running in Windows whereas typesense was running in docker run on WSL Ubuntu on the same physical machine. In the configuration where port redirection succeeded, the docusaurus ssite was running on ubuntu. This might explain why it works in one situation but not another.

Jason Bosco

07/08/2022, 7:21 PM

Another shotgun approach to this could be run something like ngrok to create a tunnel endpoint for your localhost site, so it's accessible via https on 443, and then set the docsearch-scraper to run against that ngrok tunnel endpoint

Open in Slack

Previous Next