Hi team, than you all for this amazing development...
# community-help
v
Hi team, than you all for this amazing development Due to network restrictions, I'm currently trying to run the typesense-docsearch-scraper locally. So far I've installed typesense-server with WSL Ubuntu and set it up. The server is running and seems to be fine as curl http://localhost:8108/health returns {"ok":true}, but I can't get the scrapper to work I keep getting the same requests.exceptions.JSONDecodeError that is being raised by virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py From my debugging, I know that this is being called by the create_tmp_collection function on the typesense_helper.py When trying to delete in self.typesense_client.collections[self.collection_name_tmp].delete(), the client can't find a collection with that name and then inside /virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py, r.text is [127.0.1.1:8107][E1002]Fail to find method on `/collections' and r.statuscode is 404. when it tries to do error_message = r.json().get('message', 'API error.') L 114, from that response, it raises the requests.exceptions.JSONDecodeError in models.py From my understanding it should raise the ObjectNotFound exception and then pass the try on the helper. Right? What am I doing wrong here? Thank you in advance for your attention
error msg: Traceback (most recent call last): File "Desktop/Projects/typesense-docsearch-scraper/./docsearch", line 5, in <module> run() File "Desktop/Projects/typesense-docsearch-scraper/cli/src/index.py", line 147, in run exit(command.run(sys.argv[2:])) File "Desktop/Projects/typesense-docsearch-scraper/cli/src/commands/run_config.py", line 21, in run return run_config(args[0]) File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/index.py", line 44, in run_config typesense_helper.create_tmp_collection() File "Desktop/Projects/typesense-docsearch-scraper/cli/../scraper/src/typesense_helper.py", line 32, in create_tmp_collection print(self.typesense_client.collections.retrieve()) File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/collections.py", line 21, in retrieve return self.api_call.get('{0}'.format(Collections.RESOURCE_PATH)) File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 138, in get return self.make_request(requests.get, endpoint, as_json, File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 130, in make_request raise last_exception File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/typesense/api_call.py", line 114, in make_request error_message = r.json().get('message', 'API error.') File "/home/vinicius/.local/share/virtualenvs/typesense-docsearch-scraper-RhF6cRUK/lib/python3.10/site-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting ',' delimiter: line 1 column 7 (char 6)
j
Could you share the contents of your .env file?
I suspect the scraper is unable to connect to the Typesense server due to some misconfiguration
v
TYPESENSE_API_KEY=xyz TYPESENSE_HOST=172.18.182.239 TYPESENSE_PORT=8107 TYPESENSE_PROTOCOL=http TYPESENSE_PATH= # WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use
allowed_domains
to adjust the scope accordingly! # If the scraped site is behind the CloudFlare Access. CF_ACCESS_CLIENT_ID= CF_ACCESS_CLIENT_SECRET= # WARNING! Please be aware that the scraper sends auth headers to every scraped site, so use
allowed_domains
to adjust the scope accordingly! # If the scraped site is behind the Google Cloud Identity-Aware Proxy IAP_AUTH_CLIENT_ID= IAP_AUTH_SERVICE_ACCOUNT_JSON= CHROMEDRIVER_PATH=./chrome-driver/chromedriver
j
Typesense’s default API port is
8108
. Did you specifically intend to change it to
8107
?
v
So you were right. I changed my port to 8108 and the crawler is now running. But I'm getting only 1 nb. Not sure why. These are my configurations.
Copy code
{
  "index_name": "sigma-calibration",
  "start_urls": [
    "<http://192.168.0.25:3000/>"
  ],
  "sitemap_urls": [
    "<http://192.168.0.25:3000/sitemap.xml>"
  ],
  "stop_urls": [
    "/tests"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "article h1, header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 1
}
And this is the debug I got
Copy code
DEBUG:typesense.api_call:Making get /aliases/sigma-calibration
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool:<http://172.18.182.239:8108> "GET /aliases/sigma-calibration HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making put /aliases/sigma-calibration
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool:<http://172.18.182.239:8108> "PUT /aliases/sigma-calibration HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
DEBUG:typesense.api_call:Making delete /collections/sigma-calibration_1687525730
DEBUG:typesense.api_call:Try 1 to node 172.18.182.239:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.18.182.239:8108
DEBUG:urllib3.connectionpool:<http://172.18.182.239:8108> "DELETE /collections/sigma-calibration_1687525730 HTTP/1.1" 200 None
DEBUG:typesense.api_call:172.18.182.239:8108 is healthy. Status code: 200
j
The scraper doesn’t support scraping sites running on non-standard ports unfortunately.
1
I would recommend running something like ngrok to proxy port your local port 3000 to port 443 and then point the scraper at the ngrok url
v
Since I'm only running on dev, I got it working with port fowarding. Thanks a lot for your help! Everything seems to be working now
👍 1