Troubleshooting Local Scraper & Sitemap Issues
TLDR Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.
2
1
Mar 13, 2023 (7 months ago)
Rubai
07:04 AM<https://example.com>
instead of <http://host.docker.internal/>..
, it’s worked perfectly .can anyone please help me out .
here is config file
{
"index_name": "payment-page",
"js_render": true,
"js_wait": 10,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
""
],
"sitemap_alternate_links": false,
"selectors": {
"lvl0":"h1, h2 , .heading-text" ,
"lvl1": "h3, .label" ,
"lvl2": ".key-header, .step-card-header-text, .th-row",
"text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
}
output : > DocSearch: http://host.docker.internal/payment-page/android/base-sdk-integration/session 0 records)
Rubai
10:14 AMKishore Nallan
10:20 AM1
Jason
06:23 PMMar 14, 2023 (7 months ago)
Rubai
06:05 AMexaple config
[
{
"index_name": "payment-page",
"js_render": true,
"js_wait": 5,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
"",
""
],
"sitemap_alternate_links": false,
"selectors": {
"lvl0":"h1, h2 , .heading-text" ,
"lvl1": "h3, .label" ,
"lvl2": ".key-header, .step-card-header-text, .th-row",
"text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
},
{
"index_name": "payment-page2",
"js_render": true,
"js_wait": 5,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
"",
"",
""
],
"sitemap_alternate_links": false,
"selectors": {
"lvl0":"h1, h2 , .heading-text" ,
"lvl1": "h3, .label" ,
"lvl2": ".key-header, .step-card-header-text, .th-row",
"text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
}
]
Jason
02:03 PMRubai
02:26 PMJason
03:33 PMIt can’t be an array of configs like that
Jason
03:34 PMRubai
06:31 PMJason
06:33 PMRubai
06:49 PMactually in my docs we have more than one product . and we want that when we enter a product in search result it shows result for that product only .
Rubai
06:50 PMpayment-page
& payment-page2.
are two different product
Jason
11:39 PMMar 15, 2023 (7 months ago)
Rubai
05:57 AMpayment page
and in-app-upi
so , we want to filter on search result such that one product's search hits not comes in others product search result .
so how would we able to do that in a single config file .
Jason
05:35 PMSo in your
payment page
docs pages, you’d add a meta tag to all pages called say:<meta name="docsearch:product_tag" content="payment_page" />
In your
in-app-upi
docs pages, you’d add a meta tag to all pages called say:<meta name="docsearch:product_tag" content="in_app_upi" />
Then on the front-end, depending on the product docs the user is visiting right now, you would pass in a
filter_by
parameter in the docsearch.js config like this:https://typesense.org/docs/guide/docsearch.html#option-c-custom-docs-framework-with-docsearch-js-v3-modal-layout
See
typesenseSearchParameters.filter_by
.In your case you’d set that like this:
docsearch({
container: '#searchbar',
typesenseCollectionName: 'docs',
typesenseServerConfig: { ... },
typesenseSearchParameters: {
filter_by: 'product:=in_app_upi'
},
});
Mar 16, 2023 (7 months ago)
Rubai
06:25 AMRubai
10:39 AM.docsearch({
container: '#searchbar',
typesenseCollectionName: "Developer_Docs",
typesenseServerConfig: {
nodes: [{
host: 'localhost',
port: '8108',
protocol: 'http'
}],
apiKey: 'xyz',
},
typesenseSearchParameters: {
filter_by: `product:=${documentationJSON.documentation.productId}`
},
});
here is my docsearch.jsconfig . for me it's won't work ,it's fetching 0 result on search-bar .
can you please help me out what I am doing wrong here ?
here is my meta tag
<meta name="docsearch:product_tag" content="{$page.params.products}" />
Jason
04:22 PMJason
04:22 PMGET /collections
against your typesense node and post the output of that?Rubai
08:22 PM-H 'Accept: application/json, text/plain, /' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: text/plain' \
-H 'Origin: http://localhost:3000' \
-H 'Pragma: no-cache' \
-H 'Referer: http://localhost:3000/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-site' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
--data-raw '{"searches":[{"collection":"Developer_Docs","q":"session ","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"filter_by":"product:=payment-page_android"}]}' \
--compressed
Rubai
08:31 PMJason
08:35 PMfilter_by: `product:=${documentationJSON.documentation.productId}`
to
filter_by: `product_tag:=${documentationJSON.documentation.productId}`
and try again?
Jason
08:36 PMRubai
08:41 PM1
1
Mar 17, 2023 (7 months ago)
Rubai
06:49 AMconfig :
{
"index_name": "Developer_Docs",
"js_render": true,
"js_wait": 5,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
"",
"",
""
],
"sitemap_urls": [
""
],
"sitemap_alternate_links": true,
"selectors": {
"lvl0":"h1,h2,[data-search-class='lvl0']",
"lvl1":"h3,[data-search-class='lvl1']",
"lvl2":"[data-search-class='lvl2']",
"text":"p:not(:empty),[data-search-class='text']"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
}
Jason
04:36 PMRubai
06:45 PMRubai
06:46 PMsitemap_urls
not working for meMar 18, 2023 (7 months ago)
Rubai
08:52 PMconfig :
{
"index_name": "Developer_Docs",
"js_render": true,
"js_wait": 5,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
""
],
"sitemap_urls": [
""
],
"sitemap_alternate_links": true,
"selectors": {
"lvl0":"h1,h2,[data-search-class='lvl0']",
"lvl1":"h3,[data-search-class='lvl1']",
"lvl2":"[data-search-class='lvl2']",
"text":"p,[data-search-class='text']"
},
"strip_chars": " .,;:#",
"scrap_start_urls": true,
"custom_settings": {
"synonyms": {
"relevancy": ["relevant", "relevance"],
"relevant": ["relevancy", "relevance"],
"relevance": ["relevancy", "relevant"]
}
}
}
Jason
09:01 PMstart_urls
, regardless of it being mentioned in the sitemap or not.So for eg, in the above config, it will only crawl pages like this:
<https://docs.juspay.in/payment-page/android/overview/integration-architecture/*>
Jason
09:02 PM<https://docs.juspay.in/payment-page>
in the start_urls
since that seems to be your base url?Rubai
09:07 PMJason
09:08 PMRubai
09:10 PMRubai
09:10 PMDEBUG:scrapy.dupefilters:Filtered duplicate request: <GET > - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
I got this
Typesense
Indexed 2779 threads (79% resolved)
Similar Threads
Troubleshooting Issues with DocSearch Hits and Scraper Configuration
Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.
Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
Configuring Docusaurus and Typesense for a Documentation Site
Apoorv had trouble adding search functionality to a Docusaurus documentation website with Typesense. Jason worked through several troubleshooting steps, identified issues with Apoorv's setup, and ultimately provided solutions that successfully implemented the search bar function.
Docsearch Scrapper Metadata Configuration and Filter Problem
Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.
Troubleshooting Typesense Docsearch Scraper Setup Issue
Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.