#community-help

Crawler Deleting Old Collection and Creating New Name

TLDR James faced issues with Typesense as the crawler changed collection names, breaking their production website. Jason suggested changing "index_name" in their config file to their desired name and explained the reason behind the name combination differences.

Powered by Struct AI

1

18
8mo
Solved
Join the chat
Feb 08, 2023 (8 months ago)
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:45 PM
It seems that every time I run the crawler, it deletes my old collection and creates a new one with a new arbitrary name, which breaks my production website (because the collection name changes).
I read the entire documentation but it doesn't mention anything about managing collection names or somehow specifying the intended collection name to the scaper:
https://typesense.org/docs/guide/docsearch.html#run-the-scraper
Can someone link me to the correct documentation for this? I feel like I must be missing something basic.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:50 PM
I was just reading your other comment…

Every time the scraper runs it does the following:

1. Look at the index_name field in your docsearch-scraper config file (let’s say it’s defined as index_name: docs).
2. Create a new collection called docs_<current_unix_timestamp>
3. Create/update an alias called docs to point to docs_<current_unix_timestamp>.
4. Delete the previously scrapped version of the docs, stored in docs_<previous_timestamp>
Think of the docs alias as a symlink that points to the latest scraped version of the docs. The scraper handles this updation automatically.

In the docsearch config on the frontend, you want to use docs as the index name, instead of the timestamped collection name
08:53
Jason
08:53 PM
~So in your case, it looks like you’ve defined indexName as IsaacScript. ~

Actually that’s the not the scraper configuration. That’s the FE configuration.

Could you share your docsearch-scraper configuration?
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:55 PM
My docsearch-scraper configuration is directly copy pasted from the docs:
08:55
James
08:55 PM
{
  "index_name": "docusaurus-2",
  "start_urls": [
    "https://isaacscript.github.io/"
  ],
  "sitemap_urls": [
    "https://docusaurus.io/sitemap.xml"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
}
08:55
James
08:55 PM
This is what the docs directly tell you to use. I have made no other changes.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:55 PM
Notice this line:

"index_name": "docusaurus-2"
08:55
Jason
08:55 PM
That’s where the collection name comes from
08:56
Jason
08:56 PM
You want to change that to "index_name": "IsaacScript" and rerun the scraper
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
08:56 PM
Oh, that's confusing. Why is the value "index_name" in some places, and "typesenseCollectionName" in other places? Shouldn't they be both be called the same thing?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:57 PM
Fair question! It’s because the scraper is forked from Algolia’s docsearch project, and I made it work with Typesense. And Algolia calls a collection of documents an “index”, whereas Typesense calls a collection of documents a collection
08:57
Jason
08:57 PM
I didn’t want to change those naming conventions, especially for folks switching from Algolia to Typesense
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:12 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:43 PM
Amazing, thank you! 🙏
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:46 PM
> (i.e. the first part of the public URL that end-users will connect to).
This part feels a little confusing as to whether that includes https:// or not…
James
Photo of md5-ef9ce767061c3051535c64bcaf621dfa
James
09:46 PM
Sure, feel free to edit as needed.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:04 PM
Pushed it out!