#community-help

Handling Kinesis Stream Event Batching with Typesense

TLDR Dui had questions about how to handle Kinesis stream events with Typesense. Kishore Nallan suggested using upsert mode for creation/update and differentiating with logical deletion. After various discussions including identifying and resolving a bug, they finalized to introduce an emplace action in Typesense v0.23.

Powered by Struct AI
+16
raised_hands2
Nov 25, 2021 (22 months ago)
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
09:56 AM
Hi everyone!

So I have a Kinesis stream with a really big amount of indexes that should be created/updated/deleted.

How would I go about batching events so that I can be agnostic about the action? From the documentation, I've seen that one can batch an array of objects with one action:
client.collections('companies').documents().import(documents, {action: 'create'})

My main question is, would it be possible to make a batch with different actions or will I be forced to make different batches depending on the actions of the data that I receive from the kinesis stream?

For example, Algollia has this:
client.multipleBatch([
  { action: 'addObject', indexName: 'index1', body: { firstname: 'Jimmie', lastname: 'Barninger' } },
  { action: 'deleteObject', indexName: 'index2', body: { objectID: 'myID5' } }
])
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:57 AM
👋 Dui We don't have a universal actions end-point. The import end-point has an upsert mode that you can use for creation/update.

Deletes have to be sent separately. But we don't have a batched delete end-point for that.

We're aware of this limitation and hope to address it in the future.
09:58
Kishore Nallan
09:58 AM
One way to handle deletes for now is to treat it as an update (logical delete) and then perform delete by query which Typesense does support.
10:00
Kishore Nallan
10:00 AM
Delete by query: https://typesense.org/docs/0.21.0/api/documents.html#delete-by-query

You can set a is_deleted: true flag for logical deletion and then periodically do batch deletes using the delete by query end-point. Your searches also need to set is_deleted: false so that these objects are not returned in search result.
10:04
Kishore Nallan
10:04 AM
Let me know if that makes sense Dui
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:18 AM
Thanks for such a quick answer!

Okay, so I'm gonna have to make a few different batches depending on the action then?
You wrote: "We're aware of this limitation and hope to address it in the future." Do you mean a universal actions end-point or batched delete -endpoint?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:19 AM
We want to explore an universal action end-point, but in the mean time, instead of different batches, logical deletion is the easiest approach.

You can have a single Kinesis consumer that can handle insert, update and delete.

Do you see any problem with using logical deletes?
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:24 AM
Nice!

So you would recommend that my kinesis consumer would differentiate between the action and run an import(action) for each?

Logical deletion seems fine as for now, I guess the only downside is that I'd have to massage the data a little bit to be able to know which ones to delete, but it's basically how we are working today anyway 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:28 AM
You can just use action=upsert regardless of whether the record is an insert or update or delete and send them all in the same batch, in a single API call.

The only additional processing you need to do is for the deletion case:. You can send a simple {"id": "<id>", "is_deleted: true} document or modify the actual document (if that's available to the consumer) to set the is_deleted field.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:38 AM
Ah, of course! Thanks 🙂
+11
03:03
Dui
03:03 PM
Kishore Nallan Sorry to bother you again, but I have a weird issue.

When I run an import with two batched updates (both are targeting the same document, where one is upserting, and one is updating) there are fields that are missing in the result.

So one update runs first, it updates 3 out of 20 fields.
Secondly, an upsert is run, which is supposed to update 17 out of 20 fields.

The result is that I only get 17 out of 20 fields defined. It's like the update gets overwritten by the upsert's missing fields.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:04 PM
Do they run parallel? And what version of Typesense are you using?
03:04
Kishore Nallan
03:04 PM
I mean, are these 2 separate import calls?
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:07 PM
I'm using v1.0.2 and yes, they are two separate import calls. I didn't think I could batch two different actions in one import call?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:07 PM
You mean 0.21? The Typesense server version, not the client.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:08 PM
ah, right: v0.21.0
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:08 PM
In v0.21, the import jobs were multiplexed because we didn't think that people might run parallel updates to the same collection. In v0.22 we have fixed that so that parallel writes to a given collection is always serialized.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:09 PM
Ok, so my problem will persist until I update to the upcoming v0.22?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:09 PM
> I didn't think I could batch two different actions in one import call?
That's correct, unless you always had the whole document, in which case, it's just upsert.
03:10
Kishore Nallan
03:10 PM
0.22 is pretty close to being released and we already have several customers using it on production. Just closing out some last mile edge cases. It's much better than 0.21 and we have addressed a lot of things.
raised_hands1
03:11
Kishore Nallan
03:11 PM
If you can tell me how you are running the Typesense server (Mac, Linux, Docker etc.) I can point you to the correct build to try it out.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:13 PM
Oh I'm using the cloud version
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:13 PM
No problem, we can upgrade your Cloud version to the latest stable 0.22 RC version.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:13 PM
Awesome. Are there docs for it?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:14 PM
Give us a few hours, and we will update this thread once that's done.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:14 PM
sweet! Thc
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:46 PM
> No problem, we can upgrade your Cloud version to the latest stable 0.22 RC version.
Do you need anything from me?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:47 PM
Yes, your cluster ID. You can email us [email protected]
+11
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
04:26 PM
Hey btw! Will nested objects be available from 0.22 as well?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:41 PM
No, we started work on that but had to defer since the changes were more involved.
Nov 26, 2021 (22 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:34 AM
The cluster has been upgraded.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
09:26 AM
Thanks a bunch! One weird thing that remains.

I make two updates in one import:
const body = [
  { id: 'test_id', lastChance: false },
  {
     id: 'test_id',
     'price_SE.amount': 5000,
     'price_SE.currency': 'SEK'
  }
]

typesenseClient
  .collections('test_index')
  .documents()
  .import(body, { 'update', batch_size: 100 })

When I look in the data in typesense, it seems like the lastChance-field from the first object is undefined. And if I reverse the order of these two, the price_SE.amount gets undefined instead.

Do you know what could be the case? It's like it is updating the fields that are missing from each of the objects with undefined instead of ignoring them.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:28 AM
What was the original field value of the test_id document before the update was called?
09:46
Kishore Nallan
09:46 AM
Also Dui are these 3 fields optional in the schema?
09:54
Kishore Nallan
09:54 AM
I've put together a quick example here: https://gist.github.com/kishorenc/fe52722a579a86d9fbb49c789f783d32

The update works fine for me. If you can provide a similar reproduceable example, I will be happy to take a look!
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:39 AM
Ok, thanks! Probably missing something with the initial value and that it is optional
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:40 AM
> When I look in the data in typesense, it seems like the lastChance-field from the first object is undefined.
Does the field not exist in the returned document, or the value of the field is undefined? I won't rule out a bug lurking here, but need more information to ascertain that.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:42 AM
it doesn't exist in the first object, that's why I thought it would be ignored
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:43 AM
You first inserted a document with no lastChance field, then ran the import above and now when you query the document the lastChance is not found? Sorry for being a bit slow here 🙂
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
10:55 AM
yeah so I ran an upsert import with no lastChance . After that, I ran two update imports where one of them contain lastChance. Then the lastChance is not found.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:57 AM
That shouldn't happen. Can you please share a code snippet (any client is fine) which reproduces the issue in a small stand-alone example that I can run?
11:00
Kishore Nallan
11:00 AM
Okay, I am able to reproduce it. Let me get back to you on what's going wrong here.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
11:00 AM
this is my code:
const batchImport = async ({ index, action, body }) => {
  if (!body) return
  typesenseClient
    .collections(index)
    .documents()
    .import(body, { action, batch_size: 100 })
    .then((res) => console.log(JSON.stringify(res)))
    .catch((e) => console.error(JSON.stringify(e)))
}

await batchImport({ index, action: 'upsert', body: groupedData['upsert'] })
await batchImport({ index, action: 'update', body: groupedData['update'] })

this is the data:
groupedData {
  upsert: [
    {
      id: 'CYgQsv4oZi',
      'metadata.material': [Array],
      'metadata.brand': 'Nike',
      'categories.lvl0': [Array],
      'categories.lvl1': [Array],
      'categories.lvl2': [Array],
      'categories.lvl3': [Array]
    }
  ],
  update: [
    { id: 'CYgQsv4oZi', lastChance: false },
    {
      id: 'CYgQsv4oZi',
      'price_SE.amount': 5000,
      'price_SE.currency': 'SEK'
    }
  ]
}

Hope you can make that out 🙂 !

the reason for the batch is because when this works properly, I'm gonna start importing them by the thousands 😛
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:29 PM
Found the issue. I'm working on a fix, and will be able to patch your cluster once I have it ready.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
12:30 PM
Awesome! Bug?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:30 PM
Yup, in the logic that reconciles fields of repeating document IDs within the same batch.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
12:31 PM
Okay, glad we found it! Will it be available in the v0.22 or how is your process around fixes of that sort?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:33 PM
Yes, it will be available as part of the 0.22 release, which is not GA yet. Our release cycles are generally longish (few months) because we want to be absolutely sure that GA release is stable.

We balance that by making release candidate builds available to customers that we work closely with on new features, so that they help in both validating the feature and in overall maturity (since they will also be progressing from a dev -> prod for a given new feature that we add).
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
12:34 PM
Sounds great!
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:19 PM
I've fixed it locally, along with a test. Thanks for catching this! We will be testing this build further and once it's ready, I will update your cluster with this version and post on this thread.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
04:12 PM
Awesome, thanks!
Nov 27, 2021 (22 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:08 AM
We've upgraded your cluster. Please try it out and let me know Dui.
Nov 29, 2021 (22 months ago)
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:27 PM
Thank you! It's working now 🙂

One question that is related though:
If I run three upsert in a sequence which are targeting the same document (same id), it doesn't get updated for each upsert. Any ideas?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:29 PM
Upserts of an ID that already exists in the collection, or an upsert where the ID is inserted for the first time with 3 entries in the upsert batch?
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:42 PM
basically this payload:

[
  {
    "id": "CYgQsv4oZi",
    "createdAt": 1637160199,
    "updatedAt": 1637576744,
    "metadata.brand": "Nike"
  },
  {
    "id": "CYgQsv4oZi",
    "lastChance": false
  },
  {
    "id": "CYgQsv4oZi",
    "pricing.amount": 50,
    "pricing.currency": "SEK"
  }
]

The end-result is that only the last object in the array gets written to the document.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:43 PM
Does CYgQsv4oZi already exist in the collection before this payload is sent?
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:44 PM
No, it does not
01:48
Dui
01:48 PM
But when I try to run an upsert on that document with additional info, it writes the new data and removes the previously added data within the document.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:48 PM
Upsert replaces the whole document. If you want to do partial updates, you have use update action.
01:49
Kishore Nallan
01:49 PM
insert: document does not exist and you want to insert whole document.

upsert: document might or might not exist and you want to insert/replace whole document.

update: document certainly exists and you want to insert part/whole document.
01:51
Kishore Nallan
01:51 PM
So, if you want to make the changes from all 3 entries in the batch to be reflected, use action=update.
+11
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:51 PM
ah! Okay, I thought upsert was an insert OR update depending on wether the document existed. Like algolia's partialUpdate with createIfNotExists . Does something like that exist?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:52 PM
No, we don't have an upsert+update behavior. This is how it worked in 0.20 but we had to change it because that's not a common behavior for word upsert for people familiar with other DBs so we had to switch to that.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:53 PM
Understandable. Will there be a fourth option eventually, you think?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:54 PM
Yes, I was thinking that maybe an emplace action could be introduced that does upsert if document is not available or does update if it already exists and to which you can either pass whole or partial document.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:56 PM
Nice! Do you know (approximately) in which release?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:57 PM
Can you please create an issue on Github for that for tracking? It will have to be prioritized for the 0.23 (0.22 will be out soon but it's under code freeze and we are only fixing bugs that show up in the final stretch).
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:57 PM
Sure, thanks a bunch!
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:57 PM
But we can make a 0.23 RC build available as soon as we have feature ready.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:04 PM
Thank you, I will keep you posted!
raised_hands1
Dec 07, 2021 (21 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:36 PM
Dui How's your exploration going? Any other feedback?
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
01:05 PM
Hey! It's going well, thanks for asking!

I'm discussing with my team so we will reach out soon 🙂
Dec 08, 2021 (21 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:24 AM
Glad to hear! We just released v0.22 GA and the bug you reported earlier is part of that. Thank you!
Dec 13, 2021 (21 months ago)
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
02:21 PM
Glad to hear that 🙂
02:22
Dui
02:22 PM
Do you have any update on when this issue might be fixed? It's sort of a show-stopper for us, we've realized...
https://github.com/typesense/typesense/issues/447
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:27 PM
Would Jan first week be something that would work for you?
02:28
Kishore Nallan
02:28 PM
If that's too far out, let me know and we will re-assess.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
02:59 PM
No that is perfect!
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:00 PM
Thank you, I will update this thread when it's ready to preview.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:02 PM
Sounds great! Another one we've been discussing is the nested objects (https://github.com/typesense/typesense/issues/227), do you know in what release you'll have it in?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:02 PM
That's a lot more intricate change. We will probably be able to work on it only in late Jan or early Feb.
03:03
Kishore Nallan
03:03 PM
But it's a very frequently requested feature, so it's certainly landing.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
03:11 PM
Awesome 🙂 But great to hear that the emplace will happen so soon!

I'll reach out soon and see if we can book a meeting with some more people from my team - we are still investigating if we should move from algolia.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:12 PM
👍 Will be happy to offer balanced inputs.
+11
Dec 31, 2021 (21 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:26 AM
👋 Dui The emplace action has been implemented. Do you want to do an initial test using a Docker build for initial verification before we upgrade your Cloud cluster?
Jan 03, 2022 (20 months ago)
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
12:19 PM
Wow awesome!!
12:20
Dui
12:20 PM
I think I'll create a new account from my company instead, so perhaps I can ask you to upgrade that one when it's set up?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:20 PM
Yes we can do that.
Dui
Photo of md5-a9d19b78fd4b11dda3ca3162c25054a9
Dui
12:20 PM
Awesome, I'll reach out to you soon then 🙂
+11