Discussing Potential Sharding of Data across Nodes

TLDR Vadali spoke with Jason about potentially sharding data across multiple nodes, seeking solutions for their large dataset. Jason recommended reducing the indexed fields and considering splitting future data across multiple collections.

Photo of Vadali
Vadali
Tue, 31 Jan 2023 18:49:09 UTC

is there any plans sharding in future.

Photo of Jason
Jason
Tue, 31 Jan 2023 19:34:36 UTC

If by sharding you mean, splitting one collection across multiple nodes - no we have no plans to do that in the near future. Today, in a multi-node cluster, all collections are replicated fully to all the nodes, for high availability.

Photo of Jason
Jason
Tue, 31 Jan 2023 19:35:14 UTC

Commercially available RAM today goes up to 24 TB on a single server, so our stance is that that provides enough RAM to scale up vertically for a vast number of datasets, before having to shard and scale horizontally.

Photo of Jason
Jason
Tue, 31 Jan 2023 19:36:10 UTC

Could you expand on your use-case for sharding?

Photo of Vadali
Vadali
Tue, 31 Jan 2023 19:53:33 UTC

we are using AWS EC2 96GB, (not going 24TB ram though) and we need to index 80million music and related data unable to fit, thinking if we can shard. Data will grow but won't reduce, looking various options we have.

Photo of Jason
Jason
Wed, 01 Feb 2023 01:04:03 UTC

Got it

Photo of Jason
Jason
Wed, 01 Feb 2023 01:04:49 UTC

Btw, you want to make sure you’re only adding fields you’ll be searching / faceting / filtering / grouping / sorting on to the schema, since only those will be indexed in memory. You can still send other field in your documents for display purposes, those will be stored on disk and returned when the document is a hit

Photo of Vadali
Vadali
Tue, 21 Feb 2023 16:29:52 UTC

Also quick design question, we are a multi tenant system, is it good to have one collection of million of records with filter on tenant or have 1000's of collections each with 1000-50000 records

Photo of Jason
Jason
Tue, 21 Feb 2023 16:56:03 UTC

For up to 10s of millions of records, putting all of them in one collection is fine. But depending on your query and filter complexity, you might eventually want to shard the data for different users in multiple collections, to maintain good performance. So for eg, you could shard by a simple `user_id modulo 10` to shard into 10 collections. Or you could create one collection per user. The tradeoff is that with multiple collections, if you need to change the collection schema, you would have to handle that programmatically across all collections…