#community-help

Duplicate Search Results Issue in Web Crawler Configuration

TLDR Lane is dealing with problems related to duplicate search results from their web crawler. Jason suggested that links in the sitemap might cause this issue, but this remains unresolved.

Powered by Struct AI
Nov 29, 2022 (10 months ago)
Lane
Photo of md5-c793ac7faa870e19aa043d1f9b35abd1
Lane
09:57 PM
Hey everyone!

We have a weird problem with duplicate search results.

For context, we're crawling roughly a dozen static sites using the web crawler. Each site gets its own index.

Most of the sites were generated using MkDocs, but two are based off Docusaurus. The two Docusaurus sites are occasionally showing duplicate results, but not always. The MkDocs sites work fine.

I’ve been able to verify that the issue happens in both our production and non-prod environments, so its consistent. This leads me to think that it’s the crawler config file we’re using. The config file is pretty much a direct copy from the example site linked in the docs (I’ll paste it below).

What I haven’t been able to figure out is why some records return dupes and some do not. The index does indeed appear to contain duplicate entries judging from the returned Ids. For example:

{
   "document": {
       "anchor": "before-you-begin",
       "content": "SAML to AWS STS Keys chrome plugin",
       "content_camel": "SAML to AWS STS Keys chrome plugin",
       "docusaurus_tag": "docs-default-current",
       "hierarchy": {
           "lvl0": "Kubernetes Overview",
           "lvl1": null,
           "lvl2": "Before you begin ",
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "hierarchy.lvl0": "Kubernetes Overview",
       "hierarchy.lvl2": "Before you begin ",
       "hierarchy_camel": [
           {
               "lvl0": "Kubernetes Overview",
               "lvl1": null,
               "lvl2": "Before you begin ",
               "lvl3": null,
               "lvl4": null,
               "lvl5": null,
               "lvl6": null
           }
       ],
       "hierarchy_radio": {
           "lvl0": null,
           "lvl1": null,
           "lvl2": null,
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "hierarchy_radio_camel": {
           "lvl0": null,
           "lvl1": null,
           "lvl2": null,
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "id": "12620",
       "item_priority": 9,
       "language": "en",
       "no_variables": true,
       "objectID": "2f9d69bf04250a823c88ee2f451c33d6d34ff97a",
       "tags": [],
       "type": "content",
       "url": "snip",
       "url_without_anchor": "snip",
       "url_without_variables": "snip",
       "version": [
           "current"
       ],
       "weight": {
           "level": 0,
           "page_rank": 0,
           "position": 9
       }
   },
   "highlights": [
       {
           "field": "content",
           "matched_tokens": [
               "SAML",
               "to",
               "AWS",
               "STS"
           ],
           "snippet": "<mark>SAML</mark> <mark>to</mark> <mark>AWS</mark> <mark>STS</mark> Keys chrome plugin"
       }
   ],
   "text_match": 289361770949115905
},
{
   "document": {
       "anchor": "before-you-begin",
       "content": "SAML to AWS STS Keys chrome plugin",
       "content_camel": "SAML to AWS STS Keys chrome plugin",
       "docusaurus_tag": "docs-default-current",
       "hierarchy": {
           "lvl0": "Kubernetes Overview",
           "lvl1": null,
           "lvl2": "Before you begin ",
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "hierarchy.lvl0": "Kubernetes Overview",
       "hierarchy.lvl2": "Before you begin ",
       "hierarchy_camel": [
           {
               "lvl0": "Kubernetes Overview",
               "lvl1": null,
               "lvl2": "Before you begin ",
               "lvl3": null,
               "lvl4": null,
               "lvl5": null,
               "lvl6": null
           }
       ],
       "hierarchy_radio": {
           "lvl0": null,
           "lvl1": null,
           "lvl2": null,
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "hierarchy_radio_camel": {
           "lvl0": null,
           "lvl1": null,
           "lvl2": null,
           "lvl3": null,
           "lvl4": null,
           "lvl5": null,
           "lvl6": null
       },
       "id": "15885",
       "item_priority": 8,
       "language": "en",
       "no_variables": true,
       "objectID": "3777cda34ef57f994999ec9e5e74b8b4b995bd5a",
       "tags": [],
       "type": "content",
       "url": "snip",
       "url_without_anchor": " snip ",
       "url_without_variables": " snip ",
       "version": [
           "current"
       ],
       "weight": {
           "level": 0,
           "page_rank": 0,
           "position": 8
       }
   },
   "highlights": [
       {
           "field": "content",
           "matched_tokens": [
               "SAML",
               "to",
               "AWS",
               "STS"
           ],
           "snippet": "<mark>SAML</mark> <mark>to</mark> <mark>AWS</mark> <mark>STS</mark> Keys chrome plugin"
       }
   ],
   "text_match": 289361770949115905
},

We are using the multi-search endpoint. Our query looks something like so:

{
   "searches": [
       {
           "q": "saml to aws sts",
           "query_by": "content",
           "collection": "toolbox",
           "per_page": 10,
           "exhaustive_search": true,
           "page": 1
       },
 ///snip
       {
           "q": "saml to aws sts",
           "query_by": "content",
           "collection": "readme",
           "per_page": 10,
           "exhaustive_search": true,
           "page": 1
       }
   ]
}

Any idea where to even start diagnosing this? Here’s the crawler config:

{
  "index_name": "toolbox",
  "start_urls": [""],
  "sitemap_urls": [""],
  "js-render": true,
  "js-wait": 15,
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": ["language", "version", "type", "docusaurus_tag"],
    "attributesToRetrieve": ["hierarchy", "content", "anchor", "url", "url_without_anchor", "type"]
  }
}

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:18 PM
Here’s the diff between the two documents.

Notice how "url_without_anchor": " snip ", has spaces around snip . I suspect there is a link in the sitemap or somewhere in the docs that has a space like that, which is why they look like different URLs to the scraper, and end up creating separate docs
Nov 30, 2022 (10 months ago)
Lane
Photo of md5-c793ac7faa870e19aa043d1f9b35abd1
Lane
04:11 PM
Those spaces do not exist in the real result. That's a side effect of me sanitizing the URLs. Sorry about that...
Dec 02, 2022 (10 months ago)
Lane
Photo of md5-c793ac7faa870e19aa043d1f9b35abd1
Lane
05:15 PM
Any other suggestions? I'm still struggling to find any sort of difference between these records.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:36 PM
Could you share the full API response from Typesense?
Lane
Photo of md5-c793ac7faa870e19aa043d1f9b35abd1
Lane
05:50 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:46 PM
It looks like the same content is duplicated across two different URLs / pages on the site