#community-help

Crawler Deleting Old Collection and Creating New Name

TLDR James faced issues with Typesense as the crawler changed collection names, breaking their production website. Jason suggested changing "index_name" in their config file to their desired name and explained the reason behind the name combination differences.

Powered by Struct AI

1

Feb 08, 2023 (10 months ago)
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:45 PM
It seems that every time I run the crawler, it deletes my old collection and creates a new one with a new arbitrary name, which breaks my production website (because the collection name changes).
I read the entire documentation but it doesn't mention anything about managing collection names or somehow specifying the intended collection name to the scaper:
https://typesense.org/docs/guide/docsearch.html#run-the-scraper
Can someone link me to the correct documentation for this? I feel like I must be missing something basic.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:50 PM
I was just reading your other comment…

Every time the scraper runs it does the following:

1. Look at the index_name field in your docsearch-scraper config file (let’s say it’s defined as index_name: docs).
2. Create a new collection called docs_<current_unix_timestamp>
3. Create/update an alias called docs to point to docs_<current_unix_timestamp>.
4. Delete the previously scrapped version of the docs, stored in docs_<previous_timestamp>
Think of the docs alias as a symlink that points to the latest scraped version of the docs. The scraper handles this updation automatically.

In the docsearch config on the frontend, you want to use docs as the index name, instead of the timestamped collection name
08:53
Jason
08:53 PM
~So in your case, it looks like you’ve defined indexName as IsaacScript. ~

Actually that’s the not the scraper configuration. That’s the FE configuration.

Could you share your docsearch-scraper configuration?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:55 PM
My docsearch-scraper configuration is directly copy pasted from the docs:
08:55
James
08:55 PM
{
  "index_name": "docusaurus-2",
  "start_urls": [
    "https://isaacscript.github.io/"
  ],
  "sitemap_urls": [
    "https://docusaurus.io/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
}
08:55
James
08:55 PM
This is what the docs directly tell you to use. I have made no other changes.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:55 PM
Notice this line:

"index_name": "docusaurus-2"
08:55
Jason
08:55 PM
That’s where the collection name comes from
08:56
Jason
08:56 PM
You want to change that to "index_name": "IsaacScript" and rerun the scraper
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:56 PM
Oh, that's confusing. Why is the value "index_name" in some places, and "typesenseCollectionName" in other places? Shouldn't they be both be called the same thing?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:57 PM
Fair question! It’s because the scraper is forked from Algolia’s docsearch project, and I made it work with Typesense. And Algolia calls a collection of documents an “index”, whereas Typesense calls a collection of documents a collection
08:57
Jason
08:57 PM
I didn’t want to change those naming conventions, especially for folks switching from Algolia to Typesense
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:12 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:43 PM
Amazing, thank you! 🙏
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:46 PM
> (i.e. the first part of the public URL that end-users will connect to).
This part feels a little confusing as to whether that includes https:// or not…
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:46 PM
Sure, feel free to edit as needed.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:04 PM
Pushed it out!

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Configuring Docusaurus and Typesense for a Documentation Site

Apoorv had trouble adding search functionality to a Docusaurus documentation website with Typesense. Jason worked through several troubleshooting steps, identified issues with Apoorv's setup, and ultimately provided solutions that successfully implemented the search bar function.

1

69
29mo

Trouble with DocSearch Scraper and Pipenv Across Multiple OSs

James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.

7

161
10mo

Solving Typesense Docsearch Scraper Issues

Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.

28
24mo

Docsearch Scrapper Metadata Configuration and Filter Problem

Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.

2

82
8mo

Typesense Integration Issue in Docusaurus

Benjamin experienced an error implementing Typesense (TS) in Docusaurus. Jason identified the correct placement of the 'typesense' key within the 'themeConfig' within the docusaurus config file, resolving the issue.

7

19
17mo