#community-help

Discussions on Typesense, Collections, and Dynamic Fields

TLDR Tugay shares plans to use Typesense for their SaaS platform and asks about collection sizes and sharding. Jason clarifies Typesense's capabilities and shares a beta feature. They discuss using unique collections per customer and new improvements. Kishore Nallan and Gabe comment on threading and data protection respectively.

Powered by Struct AI

2

1

45
33mo
Solved
Join the chat
Mar 03, 2021 (33 months ago)
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
06:57 AM
👋 Hi everyone!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:58 AM
Hi Tugay! Welcome!
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
06:59 AM
Hi Jason It is great to see that you are online, I have a couple questions if you have time 😄
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:00 AM
I'll be around for about 15-20 minutes! Happy to answer questions
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:02 AM
We are planning to use Typesense for our multi-tenant SaaS platform wihch is a E-Commerce platform like Shopify. We designed our system one collection for each customer bu is there any collection limit for that. Every collection will have 5k documents on average. In the long run there may be 10k-20k collections within a cluster.
07:04
Tugay
07:04 AM
And another question is do you planning to add a sharding mechanism to Typesense?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:14 AM
There are no technical limits in Typesense on the number of collections. That said, each collection spins up 4 threads to parallelize searches, so the upper limit really depends on how many CPU cores your cluster has
07:14
Jason
07:14 AM
Any reason you want to store each customer's data in a separate index, vs using a scoped API key and storing everything in one index btw?
07:17
Jason
07:17 AM
> And another question is do you planning to add a sharding mechanism to Typesense?
We do replicate the data across multiple nodes for high availability. However if you're talking about partioning the data and storing a subset on different nodes, we don't have plans for that at the moment. But you can always do application-side sharding, by spinning up multiple clusters and then mapping certain user-id ranges to a particular cluster for eg.

You can scale vertically up to 3TB of RAM (AWS offers this for eg), and we haven't had asks to scale up beyond this size of a dataset yet, so we haven't prioritized horizontal scaling.
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:18 AM
Because in our app user can add dynamic props to a product so collection must be a dynamic for each customer and dynamic props will be used for filtering and faceting

1

07:19
Tugay
07:19 AM
We have to use alias to update an existing collection right?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:20 AM
Alias is like a symlink, you could either use that or use the collection name directly to perform operations on the collection
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:20 AM
> You can scale vertically up to 3TB of RAM (AWS offers this for eg), and we haven’t had asks to scale up beyond this size of a dataset yet, so we haven’t prioritized horizontal scaling.
👍
07:21
Tugay
07:21 AM
Can we update a collection?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:21 AM
Do you mean the schema?
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:21 AM
Yes
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:22 AM
Not at the moment unfortunately. Ah I now see what you meant earlier. You'd have to create a new collection, and then if you use an alias, update the alias to point to the new collection
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:23 AM
Yeap we are planning to do in that way 👍
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:23 AM
That said, v0.20 has a new auto-schema detection feature, where the first time a field is encountered in a record, it will automatically be indexed if you turn this mode on
07:23
Jason
07:23 AM
So if you don't need to change the datatype of a field and only need to add new fields, then the auto-schema detection feature will be useful for you
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:24 AM
Wow that would be great for us 🎉
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:24 AM
I actually have a nightly build with the feature! Would you be interested in beta testing it if I give you a docker build?
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:25 AM
> There are no technical limits in Typesense on the number of collections. That said, each collection spins up 4 threads to parallelize searches, so the upper limit really depends on how many CPU cores your cluster has
Unfortunately this will be a big bottleneck for us 😞 We need to redesign our system for that
07:25
Tugay
07:25 AM
> I actually have a nightly build with the feature! Would you be interested in beta testing it if I give you a docker build?
I would love to 🙂
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:26 AM
Awesome! Let me put some instructions together for you, the docs are not yet written for it
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:27 AM
One final question sorry for taking too much time of you 🙂 Is there any limit for number of fields?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:28 AM
> Unfortunately this will be a big bottleneck for us 😞 We need to redesign our system for that
If you allow your users to define custom fields on the product, then going down the path of one collection per user makes total sense, because the schema is different for each user. v0.20 also has some threading improvements where we'll be able to use a shared thread pool to process requests across multiple collections. So this should allow you to scale to an even higher number of collections
07:29
Jason
07:29 AM
> One final question sorry for taking too much time of you 🙂 Is there any limit for number of fields?
Happy to answer! No, there are no limits on number of fields. As long as you have sufficient RAM to hold the data, Typesense will happily chug along
07:34
Jason
07:34 AM
Alright! Here are instructions to use the new auto-schema detection feature: https://gist.github.com/jasonbosco/c712b52a4b29e84ebce82c9a5ec82ffc

I'd love to get your feedback on how it works out for your use-case.
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
07:45 AM
Thank you so much for your help, We will try it as soon as possible 👍

1

Andrew
Photo of md5-88d88db4789daa0e3abef8c3ca27772b
Andrew
07:57 AM
'Any reason you want to store each customer's data in a separate index, vs using a scoped API key and storing everything in one index btw?"
Hi Jason. I had thought scoped API keys were always scoped to a whole collection. Just reread the documentation. This feature is WAY cooler!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:59 AM
Haha! Scoped API keys are a powerful feature! The "scoped" part means scoped to particular records, not just the collection. But I can see how it can be easily misunderstood as scoped to a collection
08:00
Jason
08:00 AM
You can actually embed any of the search parameters inside a scoped API key, so it's not just for filters. If you need a particular search parameter to not be changeable by users, you can embed it in a scoped API key and do searches with that
08:01
Jason
08:01 AM
Here's another interesting use case that came up recently: https://github.com/typesense/typesense/issues/193#issuecomment-765878863

1

Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
08:09 AM
Hi Andrew because every collection will have dynamic fields and these are unique per customer, we may add all fields to collection and filter responses by using include_fields and facet_by but there may be 10k fields within a collection and I am not sure about efficiency of this solution 😄
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:55 AM
Tugay regarding the per collection threading resources allocated that Jason mentioned earlier, this is also addressed in the 0.20 RC build. A common shared thread pool is used so it's no longer a constraining factor for having a large number of collections. In fact I think having a per customer collection is an easy way to scale as it offers much flexibility and is a logical way to shard your data for performance.
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
11:26 AM
Hi Kishore Nallan yes shared thread pool improvement would be perfect for our solution also we’ve just made a little POC with RC build on a adding new fields and filtering them works good in our test cases but we need to use facet: true on dynamic fields too so it is not suitable for us now. And also are you considering to add search: false and index: false to field definition since we enable auto-schema detection we may want to prevent some fields to be indexed.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:34 AM
Implementing facet: true is easy, held back from doing that only because facets can consume memory and so enabling it on every field (especially long text fields like description) will be a huge waste of resources. Thinking of how best to handle that. One way of doing that is to enable facets only on field names ending with a _facet prefix.

> And also are you considering to add search: false  and index: false  to field definition

Would you know upfront which fields will not need to be searched upon?
Tugay
Photo of md5-e920cc88c8354329d64e9a0332a7e5e2
Tugay
11:54 AM
Yes for our e-commerce platform only product name and some additional fields will be searchable other fields will be used for filtering and facets.
12:01
Tugay
12:01 PM
> Thinking of how best to handle that. One way of doing that is to enable facets only on field names ending with a _facet prefix
This is a good solution but not flexible one, using wildcards can be considered. For example on a fields definition we can use following syntax to dynamically match field definition:

[
    {
      name: 'created_at',
      type: 'int64'
    },
    {
      name: '*_auto', 
      type: 'auto'
    },
    {
      name: '*_fct', 
      type: 'auto',
      facet: true
    },
    // stringify rest
    {
      name: '*', 
      type: 'stringify',
      facet: true
    }
] 
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:06 PM
Excellent. The index: false configuration can also be mentioned in the same way.
Gabe
Photo of md5-7aff1bf99393eb318e36513504a16e85
Gabe
04:29 PM
> Now if you use scopedApiKey to do searches instead of the main search api key, the server will automatically enforce the embedded exclude_fields param and users can't override it.
I'm using exactly this! to protect sensitive data & prevent excess data from being transmitted over the wire.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:05 PM
Gabe That's great! Did you stumble on the Github issue first or did you discover that you could do this yourself?
Gabe
Photo of md5-7aff1bf99393eb318e36513504a16e85
Gabe
06:06 PM
you told me 😁
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:13 PM
Oh lol, I’ve got some bad memory!