Troubleshooting Issues with DocSearch Hits and Scraper Configuration
TLDR Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.
2
2
1
1
1
Mar 20, 2023 (7 months ago)
Rubai
07:03 PMJason
07:03 PMJason
07:04 PM1
1
Rubai
07:05 PM1
Rubai
07:13 PMand can we add
(...)
at the start of hits if the match result are on a long text . so it's easy to understand such that there have some text before thatJason
07:40 PMThe scraper just shows all content that is on the page, as specified by the css selectors. If you don’t want examples to show you, you want to exclude that via css selectors
Jason
07:42 PM…
should be shown at the end of the hits technically… looks like that’s hidden in the UI.In your docsearch initialization code, could you try adding this:
typesenseSearchParameters: {
filter_by: '...',
highlight_affix_num_tokens: 3,
},
Rubai
07:46 PMJason
07:47 PMRubai
07:50 PMJason
07:51 PMJason
07:52 PMRubai
07:52 PMRubai
07:54 PMhttps://gist.github.com/rubai99/5964d067547bb21ead934154b088b18c
Jason
08:42 PMsnippet_threshold: 5
?Rubai
08:47 PMJason
08:47 PMtypesenseSearchParameters: {
filter_by: '...',
snippet_threshold: 5,
},
Jason
08:48 PMRubai
08:49 PM...
on hitsJason
08:49 PMJason
08:50 PMRubai
08:50 PMRubai
08:51 PMsnippet_threshold: 5,
but still getting same result like previous oneRubai
08:52 PMcurl '' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: text/plain' \
-H 'Origin: ' \
-H 'Pragma: no-cache' \
-H 'Referer: ' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-site' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
--data-raw '{"searches":[{"collection":"Developer_Docs","q":"to be present","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"sort_by":"item_priority:desc","filter_by":"product_tag:=payment-page_android","snippet_threshold":5}]}' \
--compressed
Jason
08:53 PMRubai
08:54 PMJason
08:54 PMRubai
08:55 PMRubai
09:05 PMJason
09:09 PM "snippet_threshold": 5,
"highlight_affix_num_tokens": 3
Rubai
09:12 PMJason
09:12 PMJason
09:12 PMJason
09:12 PMRubai
09:14 PMJason
09:14 PMRubai
09:15 PMJason
09:16 PMJason
09:16 PMRubai
09:17 PM...
for the text . actually I am trying to say that the screenshot are taken different time but getting same result that's why it's look like sameRubai
09:19 PMto be present
,the 2nd hits . it's a long text that's why I want to add ...
at startRubai
09:21 PMJason
09:36 PMJason
09:36 PMJason
09:39 PMRubai
09:48 PMI want to say for text hits like this ,start or end with
...
. I got this from https://docusaurus.io/ siteJason
09:55 PM3.4.0-1
and check now?Rubai
10:04 PMMar 21, 2023 (7 months ago)
Jason
03:17 AM3.4.0-8
Rubai
08:54 AMRubai
09:12 AMRubai
09:13 AMJason
01:39 PMRubai
02:10 PMcurl '' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: text/plain' \
-H 'Origin: ' \
-H 'Pragma: no-cache' \
-H 'Referer: ' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-site' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
--data-raw '{"searches":[{"collection":"Developer_Docs","q":"session api","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"sort_by":"item_priority:desc","snippet_threshold":5,"highlight_affix_num_tokens":3,"filter_by":"product_tag:=payment-page_android"}]}' \
--compressed
https://gist.github.com/rubai99/8f21fb34638fd68f0683137d4e6ee810 .
sometimes works ,sometimes braking . now it's breaking
Jason
04:07 PM<script src=""></script>
And then try replicating the same error in that screenshot and post a stack trace?
(This is hopefully pulls in the source-map and shows a proper stack trace)
Rubai
04:28 PMat first when I add the version it's working fine .
after 5-6 min I got an error .
then after 1 hr I add this version again but also get same thing , some time worked but suddenly getting this error . and now I am getting the error also
Jason
04:30 PMRubai
04:30 PMhttps://7a03-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/session
Jason
04:31 PMRubai
04:32 PMRubai
04:34 PMJason
04:34 PMJason
04:34 PMRubai
04:34 PMJason
04:37 PM3.4.0-9
?Rubai
04:37 PMRubai
04:38 PMJason
04:39 PMJason
04:40 PMHappy to hear that it works now!
Rubai
04:44 PM1
Rubai
04:47 PMnow it's working fine . great work👏
1
Jason
04:48 PM1
Mar 22, 2023 (6 months ago)
Rubai
12:37 PMpayment-page
& upi-inapp
in our documentation ,suppose 1st time we run the scraper for collection
Developer_Docs_upi-inapp
and again run the scraper for other collection of Developer_Docs_payment-page
, so can we access both collection in a single documentation ,the benefit of this is when anything change happens for a product then we can scrape again for this particular product 's collection only . so here we don't need to run the scraper every product's collection .
for reference you can check our documentation https://docs.juspay.in/
Jason
03:40 PMSo you would have to fork the scraper and update it appropriately, if you want to do partial scraping into the same collection
1
Mar 23, 2023 (6 months ago)
Rubai
07:31 AMand what are changes needed in .env for production
TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=host.docker.internal
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
Rubai
07:05 PMJason
07:29 PMJason
07:30 PMJason
07:30 PMTYPESENSE_API_KEY=<GENERATED_FROM_DASHBOARD>
TYPESENSE_HOST=
TYPESENSE_PORT=443
TYPESENSE_PROTOCOL=https
Jason
07:31 PMRubai
07:36 PMJason
07:37 PMRubai
07:38 PMJason
07:49 PMJason
07:49 PMMar 25, 2023 (6 months ago)
Rubai
10:21 PMthe error getting while build dockerfile:base from typesense-docsearch-scraper
can i change the version to 111.0.5563.110-1 . after changing the version can it be effects anything
Rubai
10:37 PMMar 26, 2023 (6 months ago)
Jason
01:52 AMJason
01:53 AMRubai
11:11 AMRubai
06:51 PMJason
07:59 PMRubai
11:32 PMMar 27, 2023 (6 months ago)
Jason
12:45 AMRubai
09:45 AMhost='host.docker.internal'
to host='localhost'
, cause we don't use docker as of now to run the scraper . we run it from VS code via an API and getting this errorJason
03:54 PM.env
file you’re usingMar 28, 2023 (6 months ago)
Rubai
07:44 AMTypesense
Indexed 2786 threads (79% resolved)
Similar Threads
Troubleshooting Typesense Docsearch Scraper Setup Issue
Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.
Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
Troubleshooting Local Scraper & Sitemap Issues
Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.
Solving Typesense Docsearch Scraper Issues
Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.
Docsearch Scrapper Metadata Configuration and Filter Problem
Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.