TLDR Vadali spoke with Jason about potentially sharding data across multiple nodes, seeking solutions for their large dataset. Jason recommended reducing the indexed fields and considering splitting future data across multiple collections.
If by sharding you mean, splitting one collection across multiple nodes - no we have no plans to do that in the near future. Today, in a multi-node cluster, all collections are replicated fully to all the nodes, for high availability.
Commercially available RAM today goes up to 24 TB on a single server, so our stance is that that provides enough RAM to scale up vertically for a vast number of datasets, before having to shard and scale horizontally.
Could you expand on your use-case for sharding?
we are using AWS EC2 96GB, (not going 24TB ram though) and we need to index 80million music and related data unable to fit, thinking if we can shard. Data will grow but won't reduce, looking various options we have.
Got it
Btw, you want to make sure you’re only adding fields you’ll be searching / faceting / filtering / grouping / sorting on to the schema, since only those will be indexed in memory. You can still send other field in your documents for display purposes, those will be stored on disk and returned when the document is a hit
Also quick design question, we are a multi tenant system, is it good to have one collection of million of records with filter on tenant or have 1000's of collections each with 1000-50000 records
For up to 10s of millions of records, putting all of them in one collection is fine. But depending on your query and filter complexity, you might eventually want to shard the data for different users in multiple collections, to maintain good performance. So for eg, you could shard by a simple `user_id modulo 10` to shard into 10 collections. Or you could create one collection per user. The tradeoff is that with multiple collections, if you need to change the collection schema, you would have to handle that programmatically across all collections…
Vadali
Tue, 31 Jan 2023 18:49:09 UTCis there any plans sharding in future.