Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
TLDR James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
4
2
1
Feb 10, 2023 (10 months ago)
Jason
02:32 AMJames
02:36 AMCan you edit my PR here to add the missing steps?
https://github.com/typesense/typesense-docsearch-scraper/pull/23/files
It's unclear from my side what those commands are doing.
1
Jason
02:36 AMJames
02:39 AMjames@2wsx:~/typesense-docsearch-scraper$ pipenv shell
Loading .env environment variables...
Loading .env environment variables...
Launching subshell in virtual environment...
. /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
james@2wsx:~/typesense-docsearch-scraper$ . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ ./docsearch docker:build
/home/james/typesense-docsearch-scraper/cli/src/commands/run_tests.py:22: SyntaxWarning: "is" with a literal. Did you mean "=="?
if args[1] is "no_browser":
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "": dial unix /var/run/docker.sock: connect: permission denied
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
Again, this is a fresh Ubuntu VM, and if it matters, I followed the normal steps to install Docker as according to the official Docker documentation here:
https://docs.docker.com/engine/install/ubuntu/
Jason
02:40 AMJason
02:40 AMJames
02:40 AMJames
02:41 AM(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ sudo ./docsearch docker:build
[sudo] password for james:
/usr/bin/env: 'python': No such file or directory
Jason
02:41 AMJason
02:41 AMJason
02:41 AMsudo docker run hello-world
work?James
02:41 AMJames
02:41 AMjames@2wsx:~/typesense-docsearch-scraper$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:aa0cc8055b82dc2509bed2e19b275c8f463506616377219d9642221ab53cf9fe
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
(amd64)
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal.
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
Jason
02:42 AMJason
02:43 AMJason
02:43 AMJames
02:43 AMJason
02:44 AMJames
02:44 AM1
James
02:52 AMJames
02:53 AM./docsearch docker:build
: => ERROR [13/26] RUN apt-get update -y && apt-get install -yq google-chrome-stable=99.0.4844.51-1 unzip 2.5s
------
> [13/26] RUN apt-get update -y && apt-get install -yq google-chrome-stable=99.0.4844.51-1 unzip:
#0 0.362 Get:1 http://dl.google.com/linux/chrome/deb stable InRelease [1811 B]
#0 0.444 Hit:2 https://deb.nodesource.com/node_8.x bionic InRelease
#0 0.450 Hit:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
#0 0.462 Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
#0 0.481 Get:5 http://dl.google.com/linux/chrome/deb stable/main amd64 Packages [1061 B]
#0 0.552 Hit:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
#0 0.565 Hit:7 bionic InRelease
#0 0.639 Hit:8 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
#0 0.732 Fetched 2872 B in 0s (6409 B/s)
#0 0.732 Reading package lists...
#0 1.713 Reading package lists...
#0 2.353 Building dependency tree...
#0 2.482 Reading state information...
#0 2.495 E: Version '99.0.4844.51-1' for 'google-chrome-stable' was not found
------
Dockerfile.base:37
--------------------
36 | RUN echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
37 | >>> RUN apt-get update -y && apt-get install -yq \
38 | >>> google-chrome-stable=99.0.4844.51-1 \
39 | >>> unzip
40 | RUN wget -q https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_linux64.zip
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get update -y && apt-get install -yq google-chrome-stable=99.0.4844.51-1 unzip" did not complete successfully: exit code: 100
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
Jason
05:52 AMJason
05:52 AMJason
05:52 AMJames
06:13 AMJason
05:43 PMJames
07:26 PMJason
07:27 PMJames
07:38 PM2023-02-10 19:38:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://isaacscript.github.io/> (referer: None)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
self.add_records(response, from_sitemap=False)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 65, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 67, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
2023-02-10 19:38:19 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Jason
08:02 PMJames
08:02 PMJason
08:02 PMJason
08:02 PMJames
08:03 PM'symbols_to_index': '_',
I did that, and then I got an error complaining that "symbols_to_index" needed to be an array, so I assume that you just made a typo.
Then, I updated it to be this:
James
08:04 PM'symbols_to_index': ['_'],
James
08:04 PMJason
08:28 PMprint(result)
right before this line and share that ouput?James
08:49 PM[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
Jason
08:49 PMJames
08:49 PM[{'success': True}]
James
08:50 PMJames
08:50 PMJames
08:50 PMjames@2wsx:~/crawler$ ./run.sh
[{'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://isaacscript.github.io/> (referer: None)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
self.add_records(response, from_sitemap=False)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
self.add_records(response, from_sitemap=True)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:59 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
self.add_records(response, from_sitemap=True)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
self.add_records(response, from_sitemap=True)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
self.add_records(response, from_sitemap=True)
File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
self.typesense_helper.add_records(records, response.url, from_sitemap)
File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
failed_items = list(
File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:01 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Jason
09:30 PMJason
09:30 PMPipfile
typesense = "==0.10.0"
Jason
09:30 PMpipenv install
and run the scraper againJames
09:34 PM1
Jason
09:38 PMJames
09:40 PMJason
09:41 PMJames
09:42 PMJason
09:42 PMJames
09:42 PMhttps://github.com/typesense/typesense-docsearch-scraper/pull/3
James
09:43 PMIt seems to be hanging.
Jason
09:43 PMJason
09:43 PMJames
09:43 PMDEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1):
DEBUG:urllib3.connectionpool: "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call: is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
DEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1):
DEBUG:urllib3.connectionpool: "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call: is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
James
09:44 PMJason
09:45 PMJason
09:47 PMJames
09:48 PMJason
09:48 PMJason
09:48 PMJames
09:51 PM1
James
09:59 PMJames
09:59 PMJason
10:20 PMsymbols_to_index: ['_']
, could you change that to token_separators: ['_']
and then rerun the scraper?James
10:21 PMJames
10:33 PMJames
10:33 PMJason
10:41 PMJames
11:23 PMFor example, this page: https://isaacscript.github.io/isaacscript-common/other/enums/ModCallbackCustom
I suspect that it is because there are too many elements on the page.
In Algolia land, there is a really nice GUI that tells you the specific pages that had 404s or otherwise had errors.
Is there a way to get that kind of functionality from the Typesense crawler?
Are there any blogs you can point me towards that explain how to start troubleshooting this kind of thing?
Jason
11:24 PMJason
11:25 PMToo much hits, DocSearch only handle
Jason
11:25 PMJames
11:25 PMI don't have any results for "Too much hits".
Anything more specific that I should be looking for?
James
11:26 PMJason
11:26 PMJason
11:27 PMJason
11:28 PMJason
11:29 PMJames
11:33 PMJames
11:33 PMJason
11:34 PMJason
11:35 PMJason
11:36 PMJames
11:37 PMJason
11:37 PMJames
11:37 PMJason
11:38 PMJames
11:38 PMtypesense-website
, or if you wanted to take care of it.Jason
11:38 PMJames
11:45 PMJames
11:45 PMJason
11:46 PMJames
11:46 PM1
James
11:46 PMJason
11:47 PMFeb 11, 2023 (10 months ago)
Typesense
Indexed 3011 threads (79% resolved)
Similar Threads
Troubleshooting Typesense Docsearch Scraper Setup Issue
Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.
Solving Typesense Docsearch Scraper Issues
Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.
Troubleshooting Issues with DocSearch Hits and Scraper Configuration
Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.
Configuring Docusaurus and Typesense for a Documentation Site
Apoorv had trouble adding search functionality to a Docusaurus documentation website with Typesense. Jason worked through several troubleshooting steps, identified issues with Apoorv's setup, and ultimately provided solutions that successfully implemented the search bar function.
TypeSense Scraper Issues in CI Pipelines
James encountered errors with the TypeSense scraper; Jason pushed a fix that worked for James, but Kostis still experienced issues.