Troubleshooting Typesense Docsearch Scraper Setup Issue

TLDR Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.

Photo of Vinicius
Vinicius
Thu, 22 Jun 2023 20:10:53 UTC

Hi team, than you all for this amazing development Due to network restrictions, I'm currently trying to run the typesense-docsearch-scraper locally. So far I've installed typesense-server with WSL Ubuntu and set it up. The server is running and seems to be fine as curl returns {"ok":true}, but I can't get the scrapper to work I keep getting the same requests.exceptions.JSONDecodeError that is being raised by virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py From my debugging, I know that this is being called by the create_tmp_collection function on the typesense_helper.py When trying to delete in self.typesense_client.collections[self.collection_name_tmp].delete(), the client can't find a collection with that name and then inside /virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py, r.text is [127.0.1.1:8107][E1002]Fail to find method on `/collections' and r.statuscode is 404. when it tries to do error_message = r.json().get('message', 'API error.') L 114, from that response, it raises the requests.exceptions.JSONDecodeError in models.py From my understanding it should raise the ObjectNotFound exception and then pass the try on the helper. Right? What am I doing wrong here? Thank you in advance for your attention

Photo of Vinicius
Vinicius
Thu, 22 Jun 2023 20:11:27 UTC

error msg: Traceback (most recent call last): File "Desktop/Projects/typesense-docsearch-scraper/./docsearch", line 5, in <module> run() File "Desktop/Projects/typesense-docsearch-scraper/cli/src/index.py", line 147, in run exit(command.run(sys.argv[2:])) File "Desktop/Projects/typesense-docsearch-scraper/cli/src/commands/run_config.py", line 21, in run return run_config(args[0]) File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/index.py", line 44, in run_config typesense_helper.create_tmp_collection() File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/typesense_helper.py", line 32, in create_tmp_collection print(self.typesense_client.collections.retrieve()) File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/collections.py", line 21, in retrieve return self.api_call.get('{0}'.format(Collections.RESOURCE_PATH)) File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 138, in get return self.make_request(requests.get, endpoint, as_json, File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 130, in make_request raise last_exception File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 114, in make_request error_message = r.json().get('message', 'API error.') File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting ',' delimiter: line 1 column 7 (char 6)

Photo of Jason
Jason
Thu, 22 Jun 2023 20:12:19 UTC

Could you share the contents of your .env file?

Photo of Jason
Jason
Thu, 22 Jun 2023 20:12:45 UTC

I suspect the scraper is unable to connect to the Typesense server due to some misconfiguration

Photo of Vinicius
Vinicius
Thu, 22 Jun 2023 20:12:58 UTC

TYPESENSE_API_KEY=xyz TYPESENSE_HOST=172.18.182.239 TYPESENSE_PORT=8107 TYPESENSE_PROTOCOL=http TYPESENSE_PATH= # WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly! # If the scraped site is behind the CloudFlare Access. CF_ACCESS_CLIENT_ID= CF_ACCESS_CLIENT_SECRET= # WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly! # If the scraped site is behind the Google Cloud Identity-Aware Proxy IAP_AUTH_CLIENT_ID= IAP_AUTH_SERVICE_ACCOUNT_JSON= CHROMEDRIVER_PATH=./chrome-driver/chromedriver

Photo of Jason
Jason
Thu, 22 Jun 2023 20:13:58 UTC

Typesense’s default API port is `8108`. Did you specifically intend to change it to `8107`?

Photo of Vinicius
Vinicius
Fri, 23 Jun 2023 13:20:55 UTC

So you were right. I changed my port to 8108 and the crawler is now running. But I'm getting only 1 nb. Not sure why. These are my configurations.

Photo of Vinicius
Vinicius
Fri, 23 Jun 2023 13:21:15 UTC

```{ "index_name": "sigma-calibration", "start_urls": [ "" ], "sitemap_urls": [ "" ], "stop_urls": [ "/tests" ], "sitemap_alternate_links": true, "selectors": { "lvl0": { "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]", "type": "xpath", "global": true, "default_value": "Documentation" }, "lvl1": "article h1, header h1", "lvl2": "article h2", "lvl3": "article h3", "lvl4": "article h4", "lvl5": "article h5, article td:first-child", "lvl6": "article h6", "text": "article p, article li, article td:last-child" }, "strip_chars": " .,;:#", "custom_settings": { "separatorsToIndex": "_", "attributesForFaceting": [ "language", "version", "type", "docusaurus_tag" ], "attributesToRetrieve": [ "hierarchy", "content", "anchor", "url", "url_without_anchor", "type" ] }, "conversation_id": [ "833762294" ], "nb_hits": 1 }```

Photo of Vinicius
Vinicius
Fri, 23 Jun 2023 13:25:19 UTC

And this is the debug I got ```DEBUG:typesense.api_call:Making get /aliases/sigma-calibration DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108 DEBUG:urllib3.connectionpool: "GET /aliases/sigma-calibration HTTP/1.1" 200 None DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200 DEBUG:typesense.api_call:Making put /aliases/sigma-calibration DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108 DEBUG:urllib3.connectionpool: "PUT /aliases/sigma-calibration HTTP/1.1" 200 None DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200 DEBUG:typesense.api_call:Making delete /collections/sigma-calibration_1687525730 DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108 DEBUG:urllib3.connectionpool: "DELETE /collections/sigma-calibration_1687525730 HTTP/1.1" 200 None DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200```

Photo of Jason
Jason
Fri, 23 Jun 2023 19:18:59 UTC

The scraper doesn’t support scraping sites running on non-standard ports unfortunately.

Photo of Jason
Jason
Fri, 23 Jun 2023 19:19:29 UTC

I would recommend running something like ngrok to proxy port your local port 3000 to port 443 and then point the scraper at the ngrok url

Photo of Vinicius
Vinicius
Fri, 23 Jun 2023 19:39:56 UTC

Since I'm only running on dev, I got it working with port fowarding. Thanks a lot for your help! Everything seems to be working now