#community-help

Solving Typesense Docsearch Scraper Issues

TLDR Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.

Powered by Struct AI
28
22mo
Solved
Join the chat
Jan 06, 2022 (22 months ago)
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
07:38 PM
Hey everyone, is anyone using docsearch with Typesense? When I tried, the results I got were really bad compared to Algolia's docsearch scraper. I also noticed the documentation website's search wasn't the best, so I'm wondering if this is just a limitation of the scraper or if I did something wrong. I'm using the exact same config with both, typesense cloud with the latest version available there. For context, algolia's scraper returned 1.3k hits, typesense only returned 500, and it seemed like none of the text was indexed, only some headers.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:40 PM
Hey Sandeep! We are using the docsearch scraper for the typesense docs website search.

The number of results returned could be different based on the index settings in Algolia and Typesense (for eg: typo tolerance sensitivity, drop tokens, etc). Is that the primary metric you're going off of?

If not, could you share some examples where the results surfaced on top are not expected in Typesense? For the same query, could you share what Algolia returns for the same dataset if possible?
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
07:51 PM
ok let me try to come up with some examples
10:24
Sandeep
10:24 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:26 PM
Thanks Sandeep! One quick thing I noticed in the first minute of the video is that the query_by fields in the Typesense Cloud dashboard are set to "anchor, content" (right below the search results)
10:26
Jason
10:26 PM
Let me get you the exact query by I use in typesense-docsearch.js
10:30
Jason
10:30 PM
This is the equivalent of configuring "Searchable Attributes" in Algolia...
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
10:31 PM
ok that helps a bit, but the results are still "wrong"

Basically, it seems none of the actual articles are being scanned, only the folders.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:31 PM
Do the total number of docs in the index match up between Algolia and Typesense?
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
10:32 PM
is there a way to see what pages were indexed? algolia's scraper returned all the urls that were scanned along with the hits per page, but the typesense one just has scrappy debug info
10:32
Sandeep
10:32 PM
no- 1.3k in algolia, 500 in typesense
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:32 PM
That's not just the search results for a particular query right? You're talking about the total number of documents in the collection?
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
10:40 PM
yea
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:50 PM
Hmmm, that seems way off... and you're using docsearch-scraper for Algolia and not their new propietary Algolia Crawler product right?
Jan 07, 2022 (22 months ago)
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
12:38 AM
correct, the legacy one. The new one def won't work for us haha, we modify cookies to do auth.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:56 AM
Sandeep Could you share your docsearch configs with me, and can I run it against your docs site locally to see what's going on?
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
02:10 AM
{
"stop_urls": [],
"selectors": {
"lvl0": ".DocSearch-content h1",
"lvl1": ".DocSearch-content h2",
"lvl2": ".DocSearch-content h3",
"lvl3": ".DocSearch-content h4",
"lvl4": ".DocSearch-content h5",
"lvl5": ".DocSearch-content h6",
"text": ".DocSearch-content div, .DocSearch-content li"
},
"selectors_exclude": [".codeBlock"],
"nb_hits": "OUTPUT OF THE CRAWL",
"index_name": "Kbeectam3rSwqrnaXhLhyqYy",
"start_urls": ["https://help.kbee.app"],
"sitemap_urls": ["https://help.kbee.app/sitemap.xml"]
}
02:11
Sandeep
02:11 AM
the search on help.kbee.app is currently using Algolia for comparison
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:37 AM
Sandeep When I run typesense-docsearch-scraper locally I do get 1317 documents created:

DEBUG:urllib3.connectionpool: "PUT /aliases/Kbeectam3rSwqrnaXhLhyqYy HTTP/1.1" 200 None
DEBUG:typesense.api_call:localhost:8108 is healthy. Status code: 200
Nb hits: 1317
previous nb_hits: None

Do you want to update the nb_hits in configs/private/kbee.json ? [y/n]:
y

[OK] configs/private/kbee.json has been updated
02:38
Jason
02:38 AM
Could you make sure you're running the latest version of the scraper?
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
02:44 AM
ok that is really good news!

So what im doing is using typesense/docsearch-base and then i have a little node server that is calling the python script directly. The script had to be modified (basically, I modify the request cookies in the custom_downloader_middleware.py file) so I pulled the latest code from the github repo master branch
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:45 AM
I wonder if it's somehow silently erroring out?
02:46
Jason
02:46 AM
I'm also running master and I'm running the python scraper directly on macOS without docker
Sandeep
Photo of md5-36b47eb6382c6b09b2d9d438719b9cc1
Sandeep
02:46 AM
ok, let me try without docker and see if that helps any
07:24
Sandeep
07:24 PM
running. without docker didn't work, but running the non-base regular docker image seems to work!
07:24
Sandeep
07:24 PM
ok so its def something on my end, thanks Jason
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:24 PM
Happy to help!