TLDR bnfd asked about differences in performance and memory usage between a single, large data collection and multiple smaller collections. Kishore Nallan explained that the latter is faster, suggested 100 collections sharded by user_id, and informed about the use and security of scoped API keys.
Latter should be much faster with higher memory overhead because the word/token index is not shared. But for the scale of 1M should not make too much of a difference. A good middle ground will be to use 100 collections where you can "shard" by `user_id % 100` to map to the collection name.
By "higher memory overhead" you mean it will end up consuming more RAM?
Correct.
Exact difference will depend on shape of data.
Would it be roughly 2x or more?
Hard do say. May not be that much I think. Very difficult to answer it because it depends on for e.g. how frequently words appear in the dataset and their distribution.
I'm not sure what you mean by "shard to map to the collection name", is there anything on the docs regarding this?
Are you going to have separate sets of users or customers?
I understand, thanks!
yes each collection belongs to different user and will only be accessed by that user
but the shape of the data will be the same, same schema
Assume that you have 100,000 customers. But you want to fit them into 100 collections which is a middle ground between creating 1 collection per customer and having just 1 collection for all customers. So we will have `collection_0` `collection_1`, etc. To find this ID, we can calculate that by doing `user_id % 100` -- this produces a number between 0 to 99 that maps to 100 different collections.
ah I see, but doesn't that mess with "scope"? user1 shouldn't be able to access user2 docs
That will be the same issue when using a single collection right.
You can control that using scoped API keys.
Aren't scoped keys for limiting search at the collection level?
No you can embed filters into them. We have elaborated on that here:
Oh, in that case why not use one collection for all users (with scoped key and filters)? I was going to use one collection for each user and scoped keys as "ACL"
If you use one collection per user, you don't need scoped API key. You can generate normal API key per collection. Scoped API keys are meant to be used for embedding filters that are baked into the key itself so you can host multiple users within the same collection.
I see
if security is the first priority, which setup is optimal? Let's say for the 100,000 customers example
Both scoped API key and regular API key are secure. Once a scoped API key is created with an embedded user_id: X filter it is as secure as using 1 collection per user_id. However on paper, 1 collection per customer gives clear separation of data.
thanks!
bnfd
Wed, 01 Sep 2021 12:42:38 UTCIs there a difference in performance between case A: one collection of 1M documents, case B: 50 collections of 20-30K documents each?