#community-help

Optimizing Document Re-ingestion in Typesense

TLDR Viktor and Elyes discuss ways to handle frequent doc updates in Typesense. Kishore Nallan recommends using the update/upsert mode, data sharding, and the emplace action for efficient re-ingestion.

Powered by Struct AI
7
10mo
Solved
Join the chat
Nov 15, 2022 (10 months ago)
Viktor
Photo of md5-972da58c82de3b38862220702e852eda
Viktor
11:29 AM
How to best handle re-ingestion of a large set of documents that are frequently updated?

Think of a CRM with content that updates often (including deletions). Our current thinking is to have stateless process that runs according to the following pseudo-code:
function reindex(docs: Doc[]) {
  const now = Date.now()

  const docsWithUpsertedAt = docs.map(doc => {...doc, addedToTypesenseAt: now })

  // Upsert docsWithUpsertedAt

  // Delete docs with filter_by=addedToTypesenseAt < $now
}

We had some concerns regarding the load on the Typesense service, especially as our documents would number in the tens of thousands. Mainly that there might be some inconsistency due to the asynchronous behaviour of upserting. What do you think about this? Are there alternative approaches worth considering?
cc Elyes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:50 AM
With this approach, between the time // Upsert docsWithUpsertedAt operation begins and the // Delete docs operation ends, some of the documents will be duplicated.
11:52
Kishore Nallan
11:52 AM
Typesense already supports an update or upsert mode during import which will ensure that only the parts of the document that are changed are updated. That reduces some load on the indexing (though the field-wise comparison between old and new document does happen).
11:53
Kishore Nallan
11:53 AM
Apart from import itself, I will recommend using multiple collections (about 5-10 depending on size of overall data) to shard out your customer data across collections. This will keep individual collections smaller and will help with all operations.
Elyes
Photo of md5-15173427846de58e1acdd41507c7bcd1
Elyes
12:40 PM
Thank you for the response! what would happen between the two operations if we used the emplace action mode instead (and we supply a stable id field in each document)?
Viktor
Photo of md5-972da58c82de3b38862220702e852eda
Viktor
12:51 PM
Kishore Nallan what’s the performance implication of large collections? Eg., how much slower does a search get if a collection is doubled in size?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:10 PM
If you used emplace you are basically replacing existing document so you don't need a separate delete by filter.

Performance of large collections really depends on shape of data, whether the data is skewed or evenly distributed etc. Upto 20M records a single collection should be sufficient.