Hello. I am transitioning my Docusaurus website fr...
# community-help
j
Hello. I am transitioning my Docusaurus website from Algolia --> Typesense: https://isaacscript.github.io/ For my crawler, I am using the default config that is recommended in the official Typesense documentation: https://github.com/algolia/docsearch-configs/blob/master/configs/docusaurus-2.json My search is currently "working" insofar that some things appear to be searchable. However, it seems that the crawler did not index some of the words in level 3 headers. For example: https://isaacscript.github.io/isaacscript-common/other/enums/StatType#tear_falling_acceleration Searching for
TEAR_FALLING_ACCELERATION
results in:
No results for "TEAR_FALLING_ACCELERATION"
Is there something else that I forgot to do for a Docusaurus website?
j
It seems to work if I remove the underscores during search…
Also notice how the field is indexed by the scraper… it has space before and after the underscore in
hierarcy.lvl3
Somehow the generated markup seems to have this issue…
So long story short, there is actually no exact match for
TEAR_FALLING_ACCELERATION
in the index collection
j
Oh, that's interesting. I wonder why Algolia is able to index it properly though.
j
Could you try adding
split_join_tokens: true
to
themeConfig.typesense.typesenseSearchParameters
to see if that helps?
Correction:
split_join_tokens: always
j
Thanks Jason. I did that, and it didn't seem to change anything. I still have 0 results for "TEAR_FALLING_ACCELERATION".
j
Another thing to try is to set
symbols_to_index
in the collection schema, and set it to
["_"]
The scraper currently uses a fixed schema, so you would have to fork the scraper and change the schema here: https://github.com/typesense/typesense-docsearch-scraper/blob/a005d7a8bbd45bd71fd3895024f05663e9f797c6/scraper/src/typesense_helper.py#L36-L53
So it should be something like:
Copy code
'name': self.collection_name_tmp,
'symbols_to_index': '_',
'fields': [...]
...
j
I filed a ticket with Docusaurus, and they say that this is Typesense's fault: https://github.com/facebook/docusaurus/issues/8645#issuecomment-1423386178 I'll work on forking the plugin now, thank you.
🤔 1
I consider myself to be an expert at Linux systems, but I wasn't able to follow the instructions here: https://github.com/typesense/typesense-docsearch-scraper#releasing-a-new-version From a fresh Ubuntu 22 server, I ran into several roadblocks, the last of which is dotenv related, which I presume should automatically be handled by the pipenv environment. Can we update the README file with more detailed instructions of exactly what to install/run from a fresh Ubuntu 22, step by step? I understand not wanting to add detailed documentation for every Linux distribution, but I feel like at the very least the thing should be able to be built from on Ubuntu.
j
Hmm, I follow those exact instructions any time we need to publish a new version of the scraper. I do run it from macOS though. May I know what errors you ran into?
j
• The first error was running pipenv after installing it from apt, which resulted in a bunch of errors. • After some googling, I saw it was recommended to install pipenv from pypi instead, so I apt removed it and tried that:
Copy code
pip install --user pipenv
• That worked, but then I got an error about Python 3.6 not being installed. • In order to install Python 3.6, I determined that I needed to install pyenv, so I ran:
Copy code
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
And then:
Copy code
curl <https://pyenv.run> | bash
Then I was able to do:
Copy code
pyenv install 3.6.3
But then I got an error when creating the env:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ pipenv shell
Creating a virtualenv for this project...
Pipfile: /home/james/typesense-docsearch-scraper/Pipfile
Using /home/james/.pyenv/versions/3.6.3/bin/python3.6m (3.6.3) to create virtualenv...
⠧ Creating virtual environment...fail
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/virtualenv/seed/embed/via_app_data/via_app_data.py", line 84, in _get
    result = get_wheel(
  File "/usr/lib/python3/dist-packages/virtualenv/seed/wheels/acquire.py", line 26, in get_wheel
    wheel = from_bundle(distribution, version, for_py_version, search_dirs, app_data, do_periodic_update, env)
  File "/usr/lib/python3/dist-packages/virtualenv/seed/wheels/bundle.py", line 13, in from_bundle
    wheel = load_embed_wheel(app_data, distribution, for_py_version, of_version)
  File "/usr/lib/python3/dist-packages/virtualenv/seed/wheels/bundle.py", line 33, in load_embed_wheel
    wheel = get_embed_wheel(distribution, for_py_version)
  File "/usr/lib/python3/dist-packages/virtualenv/seed/wheels/embed/__init__.py", line 77, in get_embed_wheel
    raise Exception((
Exception: Wheel for pip for Python 3.6 is unavailable. apt install python3-pip-whl
created virtual environment CPython3.6.3.final.0-64 in 1417ms
  creator CPython3Posix(dest=/home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/james/.local/share/virtualenv)
    added seed packages: pip==21.3.1, setuptools==59.6.0, wheel==0.37.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

✔ Successfully created virtual environment!
Virtualenv location: /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y
Launching subshell in virtual environment...
 . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
I proceeded anyway, and crossed my fingers, but the next command also fails:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ pipenv shell
Launching subshell in virtual environment...
 . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
james@2wsx:~/typesense-docsearch-scraper$  . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ ./docsearch docker:build
Traceback (most recent call last):
  File "./docsearch", line 3, in <module>
    from cli.src.index import run
  File "/home/james/typesense-docsearch-scraper/cli/src/index.py", line 3, in <module>
    from dotenv import load_dotenv
ModuleNotFoundError: No module named 'dotenv'
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ pip install dotenv
Collecting dotenv
  Downloading dotenv-0.0.5.tar.gz (2.4 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status -11:
   command: /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-apjacqc9/dotenv_ddf8c6da6ef9463bb8285aad152279ed/setup.py'"'"'; __file__='"'"'/tmp/pip-install-apjacqc9/dotenv_ddf8c6da6ef9463bb8285aad152279ed/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-lxvlxqvp
       cwd: /tmp/pip-install-apjacqc9/dotenv_ddf8c6da6ef9463bb8285aad152279ed/
  Complete output (0 lines):
  ----------------------------------------
WARNING: Discarding <https://files.pythonhosted.org/packages/e2/46/3754073706e31670eed18bfa8a879305b56a471db15f20523c2427b10078/dotenv-0.0.5.tar.gz#sha256=b58d2ab3f83dbd4f8a362b21158a606bee87317a9444485566b3c8f0af847091> (from <https://pypi.org/simple/dotenv/>). Command errored out with exit status -11: python setup.py egg_info Check the logs for full command output.
j
Could you run
pip install
? And then try again?
j
Inside the pipenv shell, or inside the normal shell?
j
normal shell
j
pip isn't on Ubuntu by default, because it wants you to use either pip2 or pip3.
j
Oh wait, it’s
pipenv install
j
j
Hmm, this is unfortunately beyond my python tooling knowledge… to be able to help.
I vaguely remember running into a bunch of issues with pipenv (different ones than the ones you shared above) and finally stumbled my way around to getting it to work on my local machine.
So it seems like there are many confounding variables on what could be going wrong with the python > pip > pipenv environment
j
Jason, I tried on MacOS and ran into the exact same issue. Can you update the README file with instructions for how to do it from a fresh Mac or a fresh Linux with nothing else installed?
j
I recently upgraded my mac, so this is an almost brand new OS installation as far as docsearch-scraper is concerned, but I did already have pyenv installed. This is what I did and it worked for me:
Copy code
brew install pyenv
pyenv install 3.6
pyenv local 3.6
pip install --upgrade pip
pip install --user pipenv
pipenv install
pipenv shell
j
I must insist, I believe that the errors are because of a bogus Pipfile.lock. If you remove the Pipfile.lock, and then do
pipenv install
, you will get a bunch of errors about dependencies not being able to be resolved, so this definitely seems like a problem with the repository itself.
Looks like the relevant error message is:
Copy code
[pipenv.exceptions.ResolutionFailure]: Warning: Your dependencies could not be resolved. You likely have a mismatch in your sub-dependencies.
  You can use $ pipenv install --skip-lock to bypass this mechanism, then run $ pipenv graph to inspect the situation.
  Hint: try $ pipenv lock --pre if it is a pre-release dependency.
ERROR: No matching distribution found for slacker==0.9.60
j
That version exists though: https://pypi.org/project/slacker/0.9.60/
And it installed on my machine
Could you make sure you’re running python 3.6 and not the default 2.7 that comes installed on macOS?
j
I switched back to Ubuntu:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ python --version
Python 3.6.15
You can follow these exact instructions step by step, which should be able to reproduce the problem.
j
I was able to replicate this on a brand new Ubuntu 22 machine
Researching why this is happening
j
It happens on macOS too, for what it is worth.
j
I just spent a couple of hours on this, and I somehow got it to work, but I don’t know which sequence of steps made it work 😢
j
Well, we need to find out those steps and edit my PR accordingly.
j
Going to copy my bash history, and then try again on a new machine
j
When I was testing, I found it useful to use snapshots in VirtualBox.
j
No virtualbox on M1 sadly
j
For example, I did a snapshot after a fresh install, and then I did another snapshot after the
pyenv local 3.6
command, since that takes a particularly long time.
Back when I used macOS for work I used VMWare Fusion a lot.
Is that updated for M1?
j
Haven’t used VMWare Fusion… But I heard Virtual Box is planning to support M1, with no ETA
I guess Parallels is another option, although I think Fusion is much better.
j
I used to use Parallels about 10 years ago, the UX was pretty awesome
j
Looks like Parallels also supports M1.
j
Good to know!
Ok here you go. It was a python version issue. If you upgrade to python 3.9 it works
Copy code
sudo apt update && sudo apt install   build-essential   curl   libbz2-dev   libffi-dev   liblzma-dev   libncursesw5-dev   libreadline-dev   libsqlite3-dev   libssl-dev   libxml2-dev   libxmlsec1-dev   llvm   make   tk-dev   wget   xz-utils   zlib1g-dev   --yes
curl <https://pyenv.run> | bash
echo >> ~/.bashrc
echo '# Adding pyenv' >> ~/.bashrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
pyenv install 3.9
pyenv local 3.9
pip install --upgrade pip
echo >> ~/.bashrc
echo '# Fixing pipx warning' >> ~/.bashrc
echo 'PATH=$PATH:~/.local/bin' >> ~/.bashrc
source ~/.bashrc
pip install --user pipenv
git clone <https://github.com/typesense/typesense-docsearch-scraper.git>
cd typesense-docsearch-scraper/
pipenv install
vim Pipfile # <==== Edit python version to 3.9 in Pipefile
pipenv install
pipenv shell
j
It doesn't work for me:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ pipenv install
Pipfile.lock (ba301c) out of date, updating to (402916)...
Locking [packages] dependencies...
Building requirements...
Resolving dependencies...
✘ Locking Failed!
⠹ Locking...
Traceback (most recent call last):
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/resolver.py", line 845, in <module>
    main()
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/resolver.py", line 819, in main
    _ensure_modules()
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/resolver.py", line 16, in _ensure_modules
    spec.loader.exec_module(pipenv)
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/__init__.py", line 63, in <module>
    from .cli import cli
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/cli/__init__.py", line 1, in <module>
    from .command import cli  # noqa
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/cli/command.py", line 4, in <module>
    from pipenv import environments
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/environments.py", line 10, in <module>
    from pipenv.patched.pip._vendor.platformdirs import user_cache_dir
  File "/home/james/.local/lib/python3.9/site-packages/pipenv/patched/pip/_vendor/platformdirs/__init__.py", line 5
    from __future__ import annotations
    ^
SyntaxError: future feature annotations is not defined
This is on a fresh Ubuntu 22, following these exact instructions.
j
There’s a typo in my steps. I have a
pipenv install
before editing the python version in Pipfile
Could you run
pipenv install
after editing python version?
j
I've already edited the Python version in the Pipfile. I still get the error listed above.
j
Let’s try
Copy code
pipenv --rm
pipenv --python 3.9
pipenv lock --clear
pipenv install
Let me edit that, hang on
ok done editing
j
That worked. Can you edit my PR here to add the missing steps? https://github.com/typesense/typesense-docsearch-scraper/pull/23/files It's unclear from my side what those commands are doing.
🎉 1
j
Sure will do
j
Now, I'm getting a new error:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ pipenv shell
Loading .env environment variables...
Loading .env environment variables...
Launching subshell in virtual environment...
 . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
james@2wsx:~/typesense-docsearch-scraper$  . /home/james/.local/share/virtualenvs/typesense-docsearch-scraper-fJqFon_Y/bin/activate
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ ./docsearch docker:build
/home/james/typesense-docsearch-scraper/cli/src/commands/run_tests.py:22: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if args[1] is "no_browser":
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "<http://%2Fvar%2Frun%2Fdocker.sock/_ping>": dial unix /var/run/docker.sock: connect: permission denied
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
Again, this is a fresh Ubuntu VM, and if it matters, I followed the normal steps to install Docker as according to the official Docker documentation here: https://docs.docker.com/engine/install/ubuntu/
j
Docker needs to be run as root usually on Linux
Also I’m not sure if the docker daemon autostarts after install
j
Won't running as root mess up all of the careful Python-virtual-environment-related stuff that we have been carefully setting up over the past few hours?
Copy code
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$ sudo ./docsearch docker:build
[sudo] password for james:
/usr/bin/env: 'python': No such file or directory
j
I meant the docker daemon
Could you check if it’s running
Does
sudo docker run hello-world
work?
j
Yes:
Copy code
james@2wsx:~/typesense-docsearch-scraper$ sudo docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:aa0cc8055b82dc2509bed2e19b275c8f463506616377219d9642221ab53cf9fe
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 <https://hub.docker.com/>

For more examples and ideas, visit:
 <https://docs.docker.com/get-started/>
Hmm
Although in your case the hello world container works
j
Well I'm running the hello world container as sudo.
j
Oh right
j
I will try following this guide.
👍 1
Ok, that worked. I updated the pull request again.
Now I'm getting a new error after running `./docsearch docker:build`:
Copy code
=> ERROR [13/26] RUN apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip                                                            2.5s
------
 > [13/26] RUN apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip:
#0 0.362 Get:1 <http://dl.google.com/linux/chrome/deb> stable InRelease [1811 B]
#0 0.444 Hit:2 <https://deb.nodesource.com/node_8.x> bionic InRelease
#0 0.450 Hit:3 <http://security.ubuntu.com/ubuntu> bionic-security InRelease
#0 0.462 Hit:4 <http://archive.ubuntu.com/ubuntu> bionic InRelease
#0 0.481 Get:5 <http://dl.google.com/linux/chrome/deb> stable/main amd64 Packages [1061 B]
#0 0.552 Hit:6 <http://archive.ubuntu.com/ubuntu> bionic-updates InRelease
#0 0.565 Hit:7 <http://ppa.launchpad.net/openjdk-r/ppa/ubuntu> bionic InRelease
#0 0.639 Hit:8 <http://archive.ubuntu.com/ubuntu> bionic-backports InRelease
#0 0.732 Fetched 2872 B in 0s (6409 B/s)
#0 0.732 Reading package lists...
#0 1.713 Reading package lists...
#0 2.353 Building dependency tree...
#0 2.482 Reading state information...
#0 2.495 E: Version '99.0.4844.51-1' for 'google-chrome-stable' was not found
------
Dockerfile.base:37
--------------------
  36 |     RUN echo "deb [arch=amd64]  <http://dl.google.com/linux/chrome/deb/> stable main" >> /etc/apt/sources.list.d/google-chrome.list
  37 | >>> RUN apt-get update -y && apt-get install -yq \
  38 | >>>   google-chrome-stable=99.0.4844.51-1 \
  39 | >>>   unzip
  40 |     RUN wget -q <https://chromedriver.storage.googleapis.com/99.0.4844.51/chromedriver_linux64.zip>
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get update -y && apt-get install -yq   google-chrome-stable=99.0.4844.51-1   unzip" did not complete successfully: exit code: 100
(typesense-docsearch-scraper) james@2wsx:~/typesense-docsearch-scraper$
j
I just remembered another PR that fell off my radar… I think this actually will fix a lot of these issues: https://github.com/typesense/typesense-docsearch-scraper/pull/16/files
Could you check out that branch and try running build from there?
If that works, I can merge that PR in
j
I tried it, and I get the same error relating to Google Chrome.
j
Could you try changing it to the latest version of Chrome?
j
That worked. Do you want me to add that to my PR?
j
Yeah, that would be great
j
Ok, when running the crawler, I get a new error:
Copy code
2023-02-10 19:38:18 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/>> (referer: None)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 65, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 67, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
2023-02-10 19:38:19 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/CollectibleSpriteLayer/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
j
Could you share the changes you made to that file?
j
To what file?
j
src/typesense_helper.py
Or you didn’t make any changes?
j
Well, you told me to add this:
Copy code
'symbols_to_index': '_',
I did that, and then I got an error complaining that "symbols_to_index" needed to be an array, so I assume that you just made a typo. Then, I updated it to be this:
Copy code
'symbols_to_index': ['_'],
That got me past the error, and it started crawling my website, but then I got the error that I pasted above.
j
Could you add
print(result)
right before this line and share that ouput?
j
Copy code
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
j
Is that just before it errors out?
j
Well, the first one is
Copy code
[{'success': True}]
But it generates the longer one after each error.
e.g.
Copy code
james@2wsx:~/crawler$ ./run.sh
[{'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/>> (referer: None)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 180, in parse_from_start_url
    self.add_records(response, from_sitemap=False)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:58 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/ConstantStoneShooterVariant/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:48:59 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/CollectibleAnimation/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/CollectibleType/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:00 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/ConstantStoneShooterSubType/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
Traceback (most recent call last):
  File "/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/seleuser/src/documentation_spider.py", line 172, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/home/seleuser/src/documentation_spider.py", line 152, in add_records
    self.typesense_helper.add_records(records, response.url, from_sitemap)
  File "/home/seleuser/src/typesense_helper.py", line 66, in add_records
    failed_items = list(
  File "/home/seleuser/src/typesense_helper.py", line 68, in <lambda>
    filter(lambda r: json.loads(json.loads(r))['success'] is False, result)))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict
^C^C[{'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}, {'success': True}]
2023-02-10 20:49:01 [scrapy.core.scraper] ERROR: Spider error processing <GET <https://isaacscript.github.io/isaac-typescript-definitions/enums/CollectibleSpriteLayer/>> (referer: <https://isaacscript.github.io/sitemap.xml>)
j
Looks like the typesense-python version in the lockfile is a previous version, which has slightly different behavior which is what is causing this issue.
Could you lock the typesense package to this version in
Pipfile
Copy code
typesense = "==0.10.0"
Then run
pipenv install
and run the scraper again
j
It seems to be scraping now without any errors, thanks. Do you want me to add that to the PR?
🎉 1
j
Yeah, that would be great
j
j
Is that what you’re using now to run the scraper?
j
Indeed, as it was the only way to build it, as we discussed yesterday.
j
Got it, in that case, I’ll make the change to typesense version in pipefile in that PR
j
Also I noticed that you might want to close this one, as the author seems MIA: https://github.com/typesense/typesense-docsearch-scraper/pull/3
When the crawler finishes, is it supposed to return me to my shell? It seems to be hanging.
j
It should yeah
What are the last say 20 lines?
j
Copy code
DEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node <http://isaacracing.net:8108|isaacracing.net:8108> -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <http://isaacracing.net:8108|isaacracing.net:8108>
DEBUG:urllib3.connectionpool:<https://isaacracing.net:8108> "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:<http://isaacracing.net:8108|isaacracing.net:8108> is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
DEBUG:typesense.api_call:Making post /collections/isaacscript_1676064791/documents/import
DEBUG:typesense.api_call:Try 1 to node <http://isaacracing.net:8108|isaacracing.net:8108> -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <http://isaacracing.net:8108|isaacracing.net:8108>
DEBUG:urllib3.connectionpool:<https://isaacracing.net:8108> "POST /collections/isaacscript_1676064791/documents/import HTTP/1.1" 200 None
DEBUG:typesense.api_call:<http://isaacracing.net:8108|isaacracing.net:8108> is healthy. Status code: 200
['"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"', '"{\\"success\\":true}"']
If I had to guess, it just finished going through all the URLs, but maybe its choking on some final processing, or something.
j
Hmm, I’ve never it hang like this… May be try restarting?
Btw, you had to make a chrome version upgrade right? Since you’re anyway updating code, could you also now add typesese pinning to pipfile
j
I already did.
j
I’ve merged in the PR that you’re using
After merging that, there’s now a conflict in your PR. could you resolve that?
j
Just did.
👍 1
Ok, I ran it again, and the second time it completed successfully.
However, when I search for "TEAR_FALLING_ACCELERATION" on my website, it still shows up as 0 results, so it looks like the modification didn't accomplish anything.
j
Hmm, ok let’s try one more thing. Instead of
symbols_to_index: ['_']
, could you change that to
token_separators: ['_']
and then rerun the scraper?
j
Actually, I realized something. As a hack, I tried running Prettier on the Markdown, which seems to remove the escape characters automatically, so I'll add that to my build pipeline, and see if it fixes the underlying problem.
Yeah, it looks like using Prettier fixes things:
message has been deleted
j
Yaaay!
j
Some of my webpages are not showing up in the search. For example, this page: https://isaacscript.github.io/isaacscript-common/other/enums/ModCallbackCustom I suspect that it is because there are too many elements on the page. In Algolia land, there is a really nice GUI that tells you the specific pages that had 404s or otherwise had errors. Is there a way to get that kind of functionality from the Typesense crawler? Are there any blogs you can point me towards that explain how to start troubleshooting this kind of thing?
j
You want to look the scraper logs… Looks like you already have DEBUG logs turned on based on what you shared earlier
Search for
Too much hits, DocSearch only handle
If you search for that in the scraper codebase, you can change that value
j
I did save the output, but I don't get any results for "error", and all the entries relating to "ModCallbackCustom" look to be normal messages indicating that the page was properly ingested. I don't have any results for "Too much hits". Anything more specific that I should be looking for?
I can throw the output on pastebin if that is helpful.
j
Hmm then may be the issue is something else
If the issue is about too many elements on the page that’s the error you’ll see, per the code: https://github.com/typesense/typesense-docsearch-scraper/blob/37334bbcea17df8eedeeb82200815a4fe8e02759/scraper/src/documentation_spider.py#L154
Ah, found the issue
In your scraper config ^
j
That worked, great.
Should this be documented somewhere? I feel like the TypeSense docs should tell you to do that.
j
It’s very docusaurus specific… Let me see if there’s a better docusaurus config that we can just link to
Looks like Docusaurus’s doc site has since moved to Algolia’s proprietary crawler for their search
So yeah would be good to call this out as one of the bullet points you added
j
Do you want me to do another PR?
j
Yeah that’s on a different repo
j
Aye, I already did a PR to typesense-website yesterday.
j
I already merged that in
j
Right, I was asking if you wanted me to do another PR to
typesense-website
, or if you wanted to take care of it.
j
Ah, it would be great if you can do another PR
j
Ok, I'm starting on it now.
I have a question about pricing. If I pay for a cloud-hosted Typesense by you guys, would you also automatically index it in the cloud as well?
j
No, we only host the Typesense Cloud cluster. The scraper is something you’d host on your side (typically in your CI pipeline, you’d trigger the scraper to run post deploy of your docs site)
j
Oh, that makes perfect sense, thank you.
👍 1
I'll add that to the PR as well.
j
I think there’s a tip section in there already that might mention this… about triggering it from your CI pipeline