It seems that every time I run the crawler, it del...
# community-help
j
It seems that every time I run the crawler, it deletes my old collection and creates a new one with a new arbitrary name, which breaks my production website (because the collection name changes). I read the entire documentation but it doesn't mention anything about managing collection names or somehow specifying the intended collection name to the scaper: https://typesense.org/docs/guide/docsearch.html#run-the-scraper Can someone link me to the correct documentation for this? I feel like I must be missing something basic.
j
I was just reading your other comment… Every time the scraper runs it does the following: 1. Look at the
index_name
field in your docsearch-scraper config file (let’s say it’s defined as
index_name: docs
). 2. Create a new collection called
docs_<current_unix_timestamp>
3. Create/update an alias called
docs
to point to
docs_<current_unix_timestamp>
. 4. Delete the previously scrapped version of the docs, stored in
docs_<previous_timestamp>
Think of the
docs
alias as a symlink that points to the latest scraped version of the docs. The scraper handles this updation automatically. In the docsearch config on the frontend, you want to use
docs
as the index name, instead of the timestamped collection name
~So in your case, it looks like you’ve defined indexName as
IsaacScript
. ~ Actually that’s the not the scraper configuration. That’s the FE configuration. Could you share your docsearch-scraper configuration?
j
My docsearch-scraper configuration is directly copy pasted from the docs:
Copy code
{
  "index_name": "docusaurus-2",
  "start_urls": [
    "<https://isaacscript.github.io/>"
  ],
  "sitemap_urls": [
    "<https://docusaurus.io/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [
    "/tests"
  ],
  "selectors": {
    "lvl0": {
      "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "header h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
  },
  "strip_chars": " .,;:#",
  "custom_settings": {
    "separatorsToIndex": "_",
    "attributesForFaceting": [
      "language",
      "version",
      "type",
      "docusaurus_tag"
    ],
    "attributesToRetrieve": [
      "hierarchy",
      "content",
      "anchor",
      "url",
      "url_without_anchor",
      "type"
    ]
  },
  "conversation_id": [
    "833762294"
  ],
  "nb_hits": 46250
}
This is what the docs directly tell you to use. I have made no other changes.
j
Notice this line:
"index_name": "docusaurus-2"
That’s where the collection name comes from
You want to change that to
"index_name": "IsaacScript"
and rerun the scraper
j
Oh, that's confusing. Why is the value "index_name" in some places, and "typesenseCollectionName" in other places? Shouldn't they be both be called the same thing?
j
Fair question! It’s because the scraper is forked from Algolia’s docsearch project, and I made it work with Typesense. And Algolia calls a collection of documents an “index”, whereas Typesense calls a collection of documents a collection
I didn’t want to change those naming conventions, especially for folks switching from Algolia to Typesense
j
j
Amazing, thank you! 🙏
j
j
(i.e. the first part of the public URL that end-users will connect to).
This part feels a little confusing as to whether that includes
https://
or not…
j
Sure, feel free to edit as needed.
👍 1
j
Pushed it out!