It seems that every time I run the crawler it deletes my old typesense #community-help

It seems that every time I run the crawler, it del...

James Nesta

02/08/2023, 8:45 PM

It seems that every time I run the crawler, it deletes my old collection and creates a new one with a new arbitrary name, which breaks my production website (because the collection name changes). I read the entire documentation but it doesn't mention anything about managing collection names or somehow specifying the intended collection name to the scaper: https://typesense.org/docs/guide/docsearch.html#run-the-scraper Can someone link me to the correct documentation for this? I feel like I must be missing something basic.

Jason Bosco

02/08/2023, 8:50 PM

I was just reading your other comment… Every time the scraper runs it does the following: 1. Look at the

index_name

field in your docsearch-scraper config file (let’s say it’s defined as

index_name: docs

). 2. Create a new collection called

docs_<current_unix_timestamp>

3. Create/update an alias called

docs

to point to

docs_<current_unix_timestamp>

. 4. Delete the previously scrapped version of the docs, stored in

docs_<previous_timestamp>

Think of the

docs

alias as a symlink that points to the latest scraped version of the docs. The scraper handles this updation automatically. In the docsearch config on the frontend, you want to use

docs

as the index name, instead of the timestamped collection name

Jason Bosco

02/08/2023, 8:53 PM

~So in your case, it looks like you’ve defined indexName as

IsaacScript

. ~ Actually that’s the not the scraper configuration. That’s the FE configuration. Could you share your docsearch-scraper configuration?

James Nesta

02/08/2023, 8:55 PM

My docsearch-scraper configuration is directly copy pasted from the docs:

James Nesta

02/08/2023, 8:55 PM

Copy code

{
  "index_name": "docusaurus-2",
  "start_urls": [
    "<https://isaacscript.github.io/>"
  ],
  "sitemap_urls": [
    "<https://docusaurus.io/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
}

James Nesta

02/08/2023, 8:55 PM

This is what the docs directly tell you to use. I have made no other changes.

Jason Bosco

02/08/2023, 8:55 PM

Notice this line:

"index_name": "docusaurus-2"

Jason Bosco

02/08/2023, 8:55 PM

That’s where the collection name comes from

Jason Bosco

02/08/2023, 8:56 PM

You want to change that to

"index_name": "IsaacScript"

and rerun the scraper

James Nesta

02/08/2023, 8:56 PM

Oh, that's confusing. Why is the value "index_name" in some places, and "typesenseCollectionName" in other places? Shouldn't they be both be called the same thing?

Jason Bosco

02/08/2023, 8:57 PM

Fair question! It’s because the scraper is forked from Algolia’s docsearch project, and I made it work with Typesense. And Algolia calls a collection of documents an “index”, whereas Typesense calls a collection of documents a collection

Jason Bosco

02/08/2023, 8:57 PM

I didn’t want to change those naming conventions, especially for folks switching from Algolia to Typesense

James Nesta

02/08/2023, 9:12 PM

Thanks. I submitted a PR here: https://github.com/typesense/typesense-website/pull/161/files

Jason Bosco

02/08/2023, 9:43 PM

Amazing, thank you! 🙏

James Nesta

02/08/2023, 9:44 PM

I also did this one: https://github.com/typesense/typesense-website/pull/160/files

Jason Bosco

02/08/2023, 9:46 PM

(i.e. the first part of the public URL that end-users will connect to).

This part feels a little confusing as to whether that includes

https://

or not…

James Nesta

02/08/2023, 9:46 PM

Sure, feel free to edit as needed.

👍 1

Jason Bosco

02/08/2023, 10:04 PM

Pushed it out!

Open in Slack

Previous Next