#community-help

Trouble with DocSearch Scraper and Pipenv Across Multiple OSs

TLDR James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.

Powered by Struct AI

4

2

1

Feb 10, 2023 (8 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:32 AM
ok done editing
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:36 AM
That worked.
Can you edit my PR here to add the missing steps?
https://github.com/typesense/typesense-docsearch-scraper/pull/23/files
It's unclear from my side what those commands are doing.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:36 AM
Sure will do
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:39 AM
Now, I'm getting a new error:
james@2wsx:~/typesense-docsearch-scraper$ pipenv shell
Loading .env environment variables...
Loading .env environment variables...
Launching subshell in virtual environment...
 . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
james@2wsx:~/typesense-docsearch-scraper$  . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ ./docsearch docker:build
/home/james/typesense-docsearch-scraper/cli/src/commands/run_tests.py:22: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if args[1] is "no_browser":
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "": dial unix /var/run/docker.sock: connect: permission denied
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$

Again, this is a fresh Ubuntu VM, and if it matters, I followed the normal steps to install Docker as according to the official Docker documentation here:
https://docs.docker.com/engine/install/ubuntu/
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:40 AM
Docker needs to be run as root usually on Linux
02:40
Jason
02:40 AM
Also I’m not sure if the docker daemon autostarts after install
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:40 AM
Won't running as root mess up all of the careful Python-virtual-environment-related stuff that we have been carefully setting up over the past few hours?
02:41
James
02:41 AM
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ sudo ./docsearch docker:build
[sudo] password for james:
/usr/bin/env: 'python': No such file or directory
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:41 AM
I meant the docker daemon
02:41
Jason
02:41 AM
Could you check if it’s running
02:41
Jason
02:41 AM
Does sudo docker run hello-world work?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:41 AM
Yes:
02:41
James
02:41 AM
james@2wsx:~/typesense-docsearch-scraper$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:aa0cc8055b82dc2509bed2e19b275c8f463506616377219d9642221ab53cf9fe
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 
02:43
Jason
02:43 AM
Hmm
02:43
Jason
02:43 AM
Although in your case the hello world container works
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:43 AM
Well I'm running the hello world container as sudo.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:44 AM
Oh right
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
02:44 AM
I will try following this guide.

1

02:52
James
02:52 AM
Ok, that worked. I updated the pull request again.
02:53
James
02:53 AM
Now I'm getting a new error after running ./docsearch docker:build:
 => ERROR [13/26] RUN apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip                                                            2.5s
------
 > [13/26] RUN apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip:
#0 0.362 Get:1 http://dl.google.com/linux/chrome/deb stable InRelease [1811 B]
#0 0.444 Hit:2 https://deb.nodesource.com/node_8.x bionic InRelease
#0 0.450 Hit:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
#0 0.462 Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
#0 0.481 Get:5 http://dl.google.com/linux/chrome/deb stable/main amd64 Packages [1061 B]
#0 0.552 Hit:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
#0 0.565 Hit:7  bionic InRelease
#0 0.639 Hit:8 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
#0 0.732 Fetched 2872 B in 0s (6409 B/s)
#0 0.732 Reading package lists...
#0 1.713 Reading package lists...
#0 2.353 Building dependency tree...
#0 2.482 Reading state information...
#0 2.495 E: Version '99.0.4844.51-1' for 'google-chrome-stable' was not found
------
Dockerfile.base:37
--------------------
  36 |     RUN echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
  37 | >>> RUN apt-get update -y && apt-get install -yq \
  38 | >>>   google-chrome-stable=99.0.4844.51-1 \
  39 | >>>   unzip
  40 |     RUN wget -q https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_linux64.zip
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip" did not complete successfully: exit code: 100
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:52 AM
I just remembered another PR that fell off my radar… I think this actually will fix a lot of these issues: https://github.com/typesense/typesense-docsearch-scraper/pull/16/files
05:52
Jason
05:52 AM
Could you check out that branch and try running build from there?
05:52
Jason
05:52 AM
If that works, I can merge that PR in
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
06:13 AM
I tried it, and I get the same error relating to Google Chrome.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:43 PM
Could you try changing it to the latest version of Chrome?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
07:26 PM
That worked. Do you want me to add that to my PR?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:27 PM
Yeah, that would be great
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
07:38 PM
Ok, when running the crawler, I get a new error:
2023-02-10 19:38:18 [scrapy.core.scraper] ERROR: Spider error processing <GET https://isaacscript.github.io/> (referer: None)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 65, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 67, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
2023-02-10 19:38:19 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:02 PM
Could you share the changes you made to that file?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:02 PM
To what file?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:02 PM
&gt; src/typesense_helper.py
08:02
Jason
08:02 PM
Or you didn’t make any changes?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:03 PM
Well, you told me to add this:
'symbols_to_index': '_',

I did that, and then I got an error complaining that "symbols_to_index" needed to be an array, so I assume that you just made a typo.
Then, I updated it to be this:
08:04
James
08:04 PM
'symbols_to_index': ['_'],

08:04
James
08:04 PM
That got me past the error, and it started crawling my website, but then I got the error that I pasted above.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:28 PM
Could you add print(result) right before this line and share that ouput?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:49 PM
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:49 PM
Is that just before it errors out?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:49 PM
Well, the first one is
[{'success': True}]
08:50
James
08:50 PM
But it generates the longer one after each error.
08:50
James
08:50 PM
e.g.
08:50
James
08:50 PM
james@2wsx:~/crawler$ ./run.sh
[{'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET https://isaacscript.github.io/> (referer: None)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:59 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:01 [scrapy.core.scraper] ERROR: Spider error processing <GET > (referer: https://isaacscript.github.io/sitemap.xml)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:30 PM
Looks like the typesense-python version in the lockfile is a previous version, which has slightly different behavior which is what is causing this issue.
09:30
Jason
09:30 PM
Could you lock the typesense package to this version in Pipfile

typesense = "==0.10.0"
09:30
Jason
09:30 PM
Then run pipenv install and run the scraper again
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:34 PM
It seems to be scraping now without any errors, thanks. Do you want me to add that to the PR?

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:38 PM
Yeah, that would be great
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:41 PM
Is that what you’re using now to run the scraper?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:42 PM
Indeed, as it was the only way to build it, as we discussed yesterday.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:42 PM
Got it, in that case, I’ll make the change to typesense version in pipefile in that PR
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:42 PM
Also I noticed that you might want to close this one, as the author seems MIA:
https://github.com/typesense/typesense-docsearch-scraper/pull/3
09:43
James
09:43 PM
When the crawler finishes, is it supposed to return me to my shell?
It seems to be hanging.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:43 PM
It should yeah
09:43
Jason
09:43 PM
What are the last say 20 lines?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:43 PM
DEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node  -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 
DEBUG:urllib3.connectionpool: "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call: is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
DEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node  -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 
DEBUG:urllib3.connectionpool: "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call: is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
09:44
James
09:44 PM
If I had to guess, it just finished going through all the URLs, but maybe its choking on some final processing, or something.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:45 PM
Hmm, I’ve never it hang like this… May be try restarting?
09:47
Jason
09:47 PM
Btw, you had to make a chrome version upgrade right? Since you’re anyway updating code, could you also now add typesese pinning to pipfile
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:48 PM
I already did.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:48 PM
I’ve merged in the PR that you’re using
09:48
Jason
09:48 PM
After merging that, there’s now a conflict in your PR. could you resolve that?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:51 PM
Just did.

1

09:59
James
09:59 PM
Ok, I ran it again, and the second time it completed successfully.
09:59
James
09:59 PM
However, when I search for "TEAR_FALLING_ACCELERATION" on my website, it still shows up as 0 results, so it looks like the modification didn't accomplish anything.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:20 PM
Hmm, ok let’s try one more thing. Instead of symbols_to_index: ['_'], could you change that to token_separators: ['_'] and then rerun the scraper?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
10:21 PM
Actually, I realized something. As a hack, I tried running Prettier on the Markdown, which seems to remove the escape characters automatically, so I'll add that to my build pipeline, and see if it fixes the underlying problem.
10:33
James
10:33 PM
Yeah, it looks like using Prettier fixes things:
10:33
James
10:33 PM
Image 1 for
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:41 PM
Yaaay!
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:23 PM
Some of my webpages are not showing up in the search.
For example, this page: https://isaacscript.github.io/isaacscript-common/other/enums/ModCallbackCustom
I suspect that it is because there are too many elements on the page.
In Algolia land, there is a really nice GUI that tells you the specific pages that had 404s or otherwise had errors.
Is there a way to get that kind of functionality from the Typesense crawler?
Are there any blogs you can point me towards that explain how to start troubleshooting this kind of thing?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:24 PM
You want to look the scraper logs… Looks like you already have DEBUG logs turned on based on what you shared earlier
11:25
Jason
11:25 PM
Search for Too much hits, DocSearch only handle
11:25
Jason
11:25 PM
If you search for that in the scraper codebase, you can change that value
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:25 PM
I did save the output, but I don't get any results for "error", and all the entries relating to "ModCallbackCustom" look to be normal messages indicating that the page was properly ingested.
I don't have any results for "Too much hits".
Anything more specific that I should be looking for?
11:26
James
11:26 PM
I can throw the output on pastebin if that is helpful.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:26 PM
Hmm then may be the issue is something else
11:27
Jason
11:27 PM
If the issue is about too many elements on the page that’s the error you’ll see, per the code: https://github.com/typesense/typesense-docsearch-scraper/blob/37334bbcea17df8eedeeb82200815a4fe8e02759/scraper/src/documentation_spider.py#L154
11:28
Jason
11:28 PM
Ah, found the issue
11:29
Jason
11:29 PM
In your scraper config ^
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:33 PM
That worked, great.
11:33
James
11:33 PM
Should this be documented somewhere? I feel like the TypeSense docs should tell you to do that.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:34 PM
It’s very docusaurus specific… Let me see if there’s a better docusaurus config that we can just link to
11:35
Jason
11:35 PM
Looks like Docusaurus’s doc site has since moved to Algolia’s proprietary crawler for their search
11:36
Jason
11:36 PM
So yeah would be good to call this out as one of the bullet points you added
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:37 PM
Do you want me to do another PR?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:37 PM
Yeah that’s on a different repo
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:37 PM
Aye, I already did a PR to typesense-website yesterday.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:38 PM
I already merged that in
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:38 PM
Right, I was asking if you wanted me to do another PR to typesense-website, or if you wanted to take care of it.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:38 PM
Ah, it would be great if you can do another PR
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:45 PM
Ok, I'm starting on it now.
11:45
James
11:45 PM
I have a question about pricing. If I pay for a cloud-hosted Typesense by you guys, would you also automatically index it in the cloud as well?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:46 PM
No, we only host the Typesense Cloud cluster. The scraper is something you’d host on your side (typically in your CI pipeline, you’d trigger the scraper to run post deploy of your docs site)
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
11:46 PM
Oh, that makes perfect sense, thank you.

1

11:46
James
11:46 PM
I'll add that to the PR as well.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:47 PM
I think there’s a tip section in there already that might mention this… about triggering it from your CI pipeline
Feb 11, 2023 (8 months ago)