:wave: Hi everyone! I don't know if this is the r...
# community-help
k
👋 Hi everyone! I don't know if this is the right place, but I have just discovered typesense and am trying to get it to work. Would anyone know where to look? I have described a particular problem on Stack Overflow. Thanks!
j
@Kevin Donovan I don't see any errors in the logs you posted. That's just the scraper doing it's thing and indexing docs in Typesense. You want to wait for it to fully complete before trying to search
k
Thanks Jason.
You wouldn't have any comments on the env and config files, would you? Thanks again.
j
The env and config files look fine to me
k
Then would you have any idea why the search box on our Docusaurus site just hangs there? Thanks!
Actually I figued this out. The typesenseCollectionName: parameter was not set. When this parameter is set, then the search box will not hang.
👍 1
There is a new problem - how to run the docsearch scraper on a URL that contains a port number? Is there something called 'allowed_domains' that can be modified to accept URLs with ports?
j
The scraper should work with port numbers already…
k
It seems that it works by default only with ports 80 and 443. Port 80 has to forwarded to port 3000 in order for docsearch to scrape port 3000. There is a discussion of this on github here - https://github.com/typesense/typesense/issues/628#issuecomment-1174040085.
The heart of the problem is that docsearch 'finds' the different web pages in sitemap.xml but when it attempts to crawl through them it substitutes the IP address of the localhost with the URL of the organization. It cannot find the docusaurus pages under the URL for the reason that they are just not there! If this substitution could be prevented or compensated for, then I could run docsearch on my development site,
j
Ah my bad, I didn't realize this was a limitation of scrappy, which is the underlying library the docsearch-scraper uses...
k
Thanks. How did you know that this was a limitation of scrappy?
And would you know of a workaround? Kind regards
j
I figured that out from the stacktrace posted in the issue:
Copy code
WARNING:py.warnings:/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.6/
site-packages/scrapy/spidermiddlewares/offsite.py:69: PortWarning: allowed_domains 
accepts only domains without ports. Ignoring entry host.docker.internal:3000 in 
allowed_domains.  warnings.warn(message, PortWarning)
The file that's throwing that exception is inside the scrapy package. I also searched their source code for that error message, and confirmed that the error is coming from within scrappy. re: workaround, the IP Tables based approach mentioned in the Github issue shared seems to work.
k
Redirecting the port helped resolve one problem, but unfortunately scrapy 'converted' the IP address that was passed in docker to the URL of the the organization. A new set of error messages appeared.
DEBUG:scrapy.core.engine:Crawled (404) <GET <https://www.algotrader.com/docs/virtual_spot_positions>> (referer: <http://host.docker.internal:3000/sitemap.xml>)
scrappy can read the sitemap.xml file via the redirected port, since it was able to detect
virtual_spot_positions
which is the name of a page in the docusaurus site. Unfortunately, it subsitutes the IP address implied by
host.docker.interal
with the URL of the organization. And it can't find
virtual_spot_positions
of course. Arrgh. Redirecting the port does solve one problem but led in this case to another. I will attempt to duplicate the environment that was described in the GitHub posting. In the case where I encountered this problem, the docusaurus site was running in Windows whereas typesense was running in docker run on WSL Ubuntu on the same physical machine. In the configuration where port redirection succeeded, the docusaurus ssite was running on ubuntu. This might explain why it works in one situation but not another.
j
Another shotgun approach to this could be run something like ngrok to create a tunnel endpoint for your localhost site, so it's accessible via https on 443, and then set the docsearch-scraper to run against that ngrok tunnel endpoint