#community-help

Troubleshooting Typesense Docsearch Scraper Setup Issue

TLDR Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.

Powered by Struct AI
+11
white_check_mark1
12
3mo
Solved
Join the chat
Jun 22, 2023 (3 months ago)
Vinicius
Photo of md5-9f64cdd31b5ed05208797a3431eb413b
Vinicius
08:10 PM
Hi team, than you all for this amazing development
Due to network restrictions, I'm currently trying to run the typesense-docsearch-scraper locally. So far I've installed typesense-server with WSL Ubuntu and set it up.
The server is running and seems to be fine as curl http://localhost:8108/health returns {"ok":true}, but I can't get the scrapper to work
I keep getting the same requests.exceptions.JSONDecodeError that is being raised by virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py

From my debugging, I know that this is being called by the create_tmp_collection function on the typesense_helper.py
When trying to delete in self.typesense_client.collections[self.collection_name_tmp].delete(), the client can't find a collection with that name and then inside /virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py,
r.text is [127.0.1.1:8107][E1002]Fail to find method on `/collections' and r.statuscode is 404.

when it tries to do error_message = r.json().get('message', 'API error.') L 114, from that response, it raises the requests.exceptions.JSONDecodeError in models.py

From my understanding it should raise the ObjectNotFound exception and then pass the try on the helper. Right?
What am I doing wrong here?
Thank you in advance for your attention
08:11
Vinicius
08:11 PM
error msg:
Traceback (most recent call last):
File "Desktop/Projects/typesense-docsearch-scraper/./docsearch", line 5, in <module>
run()
File "Desktop/Projects/typesense-docsearch-scraper/cli/src/index.py", line 147, in run
exit(command.run(sys.argv[2:]))
File "Desktop/Projects/typesense-docsearch-scraper/cli/src/commands/run_config.py", line 21, in run
return run_config(args[0])
File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/index.py", line 44, in run_config
typesense_helper.create_tmp_collection()
File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/typesense_helper.py", line 32, in create_tmp_collection
print(self.typesense_client.collections.retrieve())
File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/collections.py", line 21, in retrieve
return self.api_call.get('{0}'.format(Collections.RESOURCE_PATH))
File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py",
line 138, in get
return self.make_request(requests.get, endpoint, as_json,
File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py",
line 130, in make_request
raise last_exception
File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py",
line 114, in make_request
error_message = r.json().get('message', 'API error.')
File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py", line 975, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting ',' delimiter: line 1 column 7 (char 6)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:12 PM
Could you share the contents of your .env file?
08:12
Jason
08:12 PM
I suspect the scraper is unable to connect to the Typesense server due to some misconfiguration
Vinicius
Photo of md5-9f64cdd31b5ed05208797a3431eb413b
Vinicius
08:12 PM
TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=172.18.182.239
TYPESENSE_PORT=8107
TYPESENSE_PROTOCOL=http
TYPESENSE_PATH=

# WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use allowed_domains to adjust the scope accordingly!
# If the scraped site is behind the CloudFlare Access.
CF_ACCESS_CLIENT_ID=
CF_ACCESS_CLIENT_SECRET=

# WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use allowed_domains to adjust the scope accordingly!
# If the scraped site is behind the Google Cloud Identity-Aware Proxy
IAP_AUTH_CLIENT_ID=
IAP_AUTH_SERVICE_ACCOUNT_JSON=

CHROMEDRIVER_PATH=./chrome-driver/chromedriver
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:13 PM
Typesense’s default API port is 8108. Did you specifically intend to change it to 8107?
Jun 23, 2023 (3 months ago)
Vinicius
Photo of md5-9f64cdd31b5ed05208797a3431eb413b
Vinicius
01:20 PM
So you were right. I changed my port to 8108 and the crawler is now running. But I'm getting only 1 nb. Not sure why. These are my configurations.
01:21
Vinicius
01:21 PM
{
  "index_name": "sigma-calibration",
  "start_urls": [
    ""
  ],
  "sitemap_urls": [
    ""
  ],
  "stop_urls": [
    "/tests"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "article h1, header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 1
}
01:25
Vinicius
01:25 PM
And this is the debug I got
DEBUG:typesense.api_call:Making get /aliases/sigma-calibration
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool: "GET /aliases/sigma-calibration HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/sigma-calibration
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool: "PUT /aliases/sigma-calibration HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/sigma-calibration_1687525730
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool: "DELETE /collections/sigma-calibration_1687525730 HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:18 PM
The scraper doesn’t support scraping sites running on non-standard ports unfortunately.
white_check_mark1
07:19
Jason
07:19 PM
I would recommend running something like ngrok to proxy port your local port 3000 to port 443 and then point the scraper at the ngrok url
Vinicius
Photo of md5-9f64cdd31b5ed05208797a3431eb413b
Vinicius
07:39 PM
Since I'm only running on dev, I got it working with port fowarding. Thanks a lot for your help! Everything seems to be working now
+11