Duplicate Search Results Issue in Web Crawler Configuration
TLDR Lane is dealing with problems related to duplicate search results from their web crawler. Jason suggested that links in the sitemap might cause this issue, but this remains unresolved.
Nov 29, 2022 (10 months ago)
Lane
09:57 PMWe have a weird problem with duplicate search results.
For context, we're crawling roughly a dozen static sites using the web crawler. Each site gets its own index.
Most of the sites were generated using MkDocs, but two are based off Docusaurus. The two Docusaurus sites are occasionally showing duplicate results, but not always. The MkDocs sites work fine.
I’ve been able to verify that the issue happens in both our production and non-prod environments, so its consistent. This leads me to think that it’s the crawler config file we’re using. The config file is pretty much a direct copy from the example site linked in the docs (I’ll paste it below).
What I haven’t been able to figure out is why some records return dupes and some do not. The index does indeed appear to contain duplicate entries judging from the returned Ids. For example:
{
"document": {
"anchor": "before-you-begin",
"content": "SAML to AWS STS Keys chrome plugin",
"content_camel": "SAML to AWS STS Keys chrome plugin",
"docusaurus_tag": "docs-default-current",
"hierarchy": {
"lvl0": "Kubernetes Overview",
"lvl1": null,
"lvl2": "Before you begin ",
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"hierarchy.lvl0": "Kubernetes Overview",
"hierarchy.lvl2": "Before you begin ",
"hierarchy_camel": [
{
"lvl0": "Kubernetes Overview",
"lvl1": null,
"lvl2": "Before you begin ",
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
}
],
"hierarchy_radio": {
"lvl0": null,
"lvl1": null,
"lvl2": null,
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"hierarchy_radio_camel": {
"lvl0": null,
"lvl1": null,
"lvl2": null,
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"id": "12620",
"item_priority": 9,
"language": "en",
"no_variables": true,
"objectID": "2f9d69bf04250a823c88ee2f451c33d6d34ff97a",
"tags": [],
"type": "content",
"url": "snip",
"url_without_anchor": "snip",
"url_without_variables": "snip",
"version": [
"current"
],
"weight": {
"level": 0,
"page_rank": 0,
"position": 9
}
},
"highlights": [
{
"field": "content",
"matched_tokens": [
"SAML",
"to",
"AWS",
"STS"
],
"snippet": "<mark>SAML</mark> <mark>to</mark> <mark>AWS</mark> <mark>STS</mark> Keys chrome plugin"
}
],
"text_match": 289361770949115905
},
{
"document": {
"anchor": "before-you-begin",
"content": "SAML to AWS STS Keys chrome plugin",
"content_camel": "SAML to AWS STS Keys chrome plugin",
"docusaurus_tag": "docs-default-current",
"hierarchy": {
"lvl0": "Kubernetes Overview",
"lvl1": null,
"lvl2": "Before you begin ",
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"hierarchy.lvl0": "Kubernetes Overview",
"hierarchy.lvl2": "Before you begin ",
"hierarchy_camel": [
{
"lvl0": "Kubernetes Overview",
"lvl1": null,
"lvl2": "Before you begin ",
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
}
],
"hierarchy_radio": {
"lvl0": null,
"lvl1": null,
"lvl2": null,
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"hierarchy_radio_camel": {
"lvl0": null,
"lvl1": null,
"lvl2": null,
"lvl3": null,
"lvl4": null,
"lvl5": null,
"lvl6": null
},
"id": "15885",
"item_priority": 8,
"language": "en",
"no_variables": true,
"objectID": "3777cda34ef57f994999ec9e5e74b8b4b995bd5a",
"tags": [],
"type": "content",
"url": "snip",
"url_without_anchor": " snip ",
"url_without_variables": " snip ",
"version": [
"current"
],
"weight": {
"level": 0,
"page_rank": 0,
"position": 8
}
},
"highlights": [
{
"field": "content",
"matched_tokens": [
"SAML",
"to",
"AWS",
"STS"
],
"snippet": "<mark>SAML</mark> <mark>to</mark> <mark>AWS</mark> <mark>STS</mark> Keys chrome plugin"
}
],
"text_match": 289361770949115905
},
We are using the multi-search endpoint. Our query looks something like so:
{
"searches": [
{
"q": "saml to aws sts",
"query_by": "content",
"collection": "toolbox",
"per_page": 10,
"exhaustive_search": true,
"page": 1
},
///snip
{
"q": "saml to aws sts",
"query_by": "content",
"collection": "readme",
"per_page": 10,
"exhaustive_search": true,
"page": 1
}
]
}
Any idea where to even start diagnosing this? Here’s the crawler config:
{
"index_name": "toolbox",
"start_urls": [" "],
"sitemap_urls": [""],
"js-render": true,
"js-wait": 15,
"selectors": {
"lvl0": {
"selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "header h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5, article td:first-child",
"lvl6": "article h6",
"text": "article p, article li, article td:last-child"
},
"strip_chars": " .,;:#",
"custom_settings": {
"separatorsToIndex": "_",
"attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
"attributesToRetrieve": ["hierarchy", "content", "anchor", "url", "url_without_anchor", "type"]
}
}
Jason
10:18 PMNotice how
"url_without_anchor": " snip ",
has spaces around snip
. I suspect there is a link in the sitemap or somewhere in the docs that has a space like that, which is why they look like different URLs to the scraper, and end up creating separate docsNov 30, 2022 (10 months ago)
Lane
04:11 PMDec 02, 2022 (10 months ago)
Lane
05:15 PMJason
05:36 PMLane
05:50 PMJason
07:46 PMTypesense
Indexed 2779 threads (79% resolved)
Similar Threads
Troubleshooting Local Scraper & Sitemap Issues
Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.
Troubleshooting Issues with DocSearch Hits and Scraper Configuration
Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.
Solving Typesense Docsearch Scraper Issues
Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.