Comparing Performance and Security of Different Data Collection Methods

TLDR bnfd asked about differences in performance and memory usage between a single, large data collection and multiple smaller collections. Kishore Nallan explained that the latter is faster, suggested 100 collections sharded by user_id, and informed about the use and security of scoped API keys.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:42:38 UTC

Is there a difference in performance between case A: one collection of 1M documents, case B: 50 collections of 20-30K documents each?

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:44:04 UTC

Latter should be much faster with higher memory overhead because the word/token index is not shared. But for the scale of 1M should not make too much of a difference. A good middle ground will be to use 100 collections where you can "shard" by `user_id % 100` to map to the collection name.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:47:48 UTC

By "higher memory overhead" you mean it will end up consuming more RAM?

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:48:09 UTC

Correct.

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:48:52 UTC

Exact difference will depend on shape of data.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:51:10 UTC

Would it be roughly 2x or more?

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:52:49 UTC

Hard do say. May not be that much I think. Very difficult to answer it because it depends on for e.g. how frequently words appear in the dataset and their distribution.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:52:50 UTC

I'm not sure what you mean by "shard to map to the collection name", is there anything on the docs regarding this?

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:53:11 UTC

Are you going to have separate sets of users or customers?

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:53:12 UTC

I understand, thanks!

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:53:43 UTC

yes each collection belongs to different user and will only be accessed by that user

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:54:29 UTC

but the shape of the data will be the same, same schema

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:55:00 UTC

Assume that you have 100,000 customers. But you want to fit them into 100 collections which is a middle ground between creating 1 collection per customer and having just 1 collection for all customers. So we will have `collection_0` `collection_1`, etc. To find this ID, we can calculate that by doing `user_id % 100` -- this produces a number between 0 to 99 that maps to 100 different collections.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:56:29 UTC

ah I see, but doesn't that mess with "scope"? user1 shouldn't be able to access user2 docs

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:58:12 UTC

That will be the same issue when using a single collection right.

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 12:58:20 UTC

You can control that using scoped API keys.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 12:59:47 UTC

Aren't scoped keys for limiting search at the collection level?

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 13:04:37 UTC

No you can embed filters into them. We have elaborated on that here:

Photo of bnfd
bnfd
Wed, 01 Sep 2021 13:11:20 UTC

Oh, in that case why not use one collection for all users (with scoped key and filters)? I was going to use one collection for each user and scoped keys as "ACL"

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 13:13:36 UTC

If you use one collection per user, you don't need scoped API key. You can generate normal API key per collection. Scoped API keys are meant to be used for embedding filters that are baked into the key itself so you can host multiple users within the same collection.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 13:16:43 UTC

I see

Photo of bnfd
bnfd
Wed, 01 Sep 2021 13:18:50 UTC

if security is the first priority, which setup is optimal? Let's say for the 100,000 customers example

Photo of Kishore Nallan
Kishore Nallan
Wed, 01 Sep 2021 13:21:06 UTC

Both scoped API key and regular API key are secure. Once a scoped API key is created with an embedded user_id: X filter it is as secure as using 1 collection per user_id. However on paper, 1 collection per customer gives clear separation of data.

Photo of bnfd
bnfd
Wed, 01 Sep 2021 13:25:06 UTC

thanks!