#community-help

Discussing Document Indexing Speeds and Typesense Features

TLDR Thomas asks about the speed of indexing and associated factors. The conversation reveals that larger batch sizes and NVMe disk usage can improve speed, but the index size is limited by RAM. Jason shares plans on supporting nested fields, and they explore a solution for products in multiple categories and catalogs.

Powered by Struct AI

3

1

1

63
23mo
Solved
Join the chat
Feb 23, 2022 (23 months ago)
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:11 PM
How can I see the speed of indexing?
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:12 PM
Normally one the request has completed the indexing is complete

1

Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:13 PM
Ok, so indexing 3000 product documents (official sample) takes 1800ms on 3 cores?
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:14 PM
are you inserting them one at a time or via Line separated JSON?
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:14 PM
LJSON
04:14
Thomas
04:14 PM
2.6MB
04:14
Thomas
04:14 PM
from 1.8 sec to 2.5 sec.
04:14
Thomas
04:14 PM
batch is 100
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:15 PM
1.8s, sounds about right... But most of it is overhead with the request processing.

As another data point, I've indexed 2.2M docs in 3.6 minutes on a 4vCPU server
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:15 PM
With what settings?
04:15
Thomas
04:15 PM
It's taking longer and longer to index too, which is weird
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:15 PM
I sent the entire 2.2M docs as JSONL in a single import API call, with the default (server-side) batch size
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:16 PM
You can use batches that large?

1

Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:16 PM
Is this on your local machine or uploading to a remote server? Network latency also comes into the picture.
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:16 PM
Generally you should try and do them as big as possible

1

Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:17 PM
local
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:17 PM
I'd probably argue that anything under 100k docs should be done in one go

1

Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:17 PM
we get the same time no matter which batch size
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:17 PM
With 3K docs you mean?
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:17 PM
Yes
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:18 PM
Yeah that's the fixed overhead in request processing
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:18 PM
you'll only run into the network latency when doing lots of small round trips on a big index
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:18 PM
That overhead doesn't linearly scale, if your extrapolating
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:18 PM
Ok, so larger batch sizes are always better
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:18 PM
For sure
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:18 PM
i.e 100 docs at a time on a 1 million doc index
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:19 PM
Ok, noted
04:19
Thomas
04:19 PM
are there plans to support NVMe as storage?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:19 PM
You can already use NVMe disks
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:19 PM
:thinking_face: The type of storage shouldn't affect typesense
04:19
Harrison
04:19 PM
or any program for the most part
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:20 PM
From Typesense's perspective its just a file system
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:20 PM
so if it doesn't fit in RAM, it fetches from disk?
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:20 PM
the index is always stored in RAM iirc
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:20 PM
so index size isn't limited by RAM then?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:20 PM
Unless you mean storing indices on disk instead of RAM - that we have no plans
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:21 PM
So index is limited by RAM size?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:21 PM
Index size is indeed constrained by RAM. So you need to have sufficient RAM to hold the entire index in memory.
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:21 PM
Aha, understood
04:21
Thomas
04:21 PM
any plans to support nested fields?
Harrison
Photo of md5-43a35158b04c9c49110114370dbeae06
Harrison
04:22 PM
Just as an FYI you may find https://cloud.typesense.org/pricing/calculator useful for working out roughly how much memory you want
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:23 PM
I think I've seen it before
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:23 PM
Largest commercially available RAM today is 24TB, and RAM cost has only been getting cheaper. So we're hoping that for most site/app search use-cases RAM-based search that lets you build search-as-you-type instant-search experiences would work out well.
04:24
Jason
04:24 PM
> any plans to support nested fields?
Yes for sure, probably in the next few releases. Until then, here's a workaround: https://typesense.org/docs/0.22.2/api/collections.html#indexing-nested-fields
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:27 PM
any difference in using?
04:27
Thomas
04:27 PM
is it expected within the next two months?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:28 PM
I'd say probably 3-4 months time frame
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:31 PM
We have catalogs with categories that has products and we want them filterable, how do you suggest doing this with the current one?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:31 PM
04:32
Jason
04:32 PM
That's what powers this demo: https://ecommerce-store.typesense.org/
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:32 PM
We did, but that's only categories, not multiple catalogs
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:32 PM
Didn't get you... Could you expand on what you mean by multiple catalogs with an example?
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:33 PM
1 product can be in multiple catalogs and categories
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:34 PM
You could have a catalog_ids: [1,4,6] field in each product
04:34
Jason
04:34 PM
and then depending on which catalog you're rendering, filter by the catalog id?
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:38 PM
catalogs are multi language and need labels
04:38
Thomas
04:38 PM
our idea was two collections, one for catalogs and categories and one for products
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:40 PM
I'd need a little more context on how you plan to query the dataset and the search UI (if you have mockups), since that will dictate how you structure the collections
Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
04:42 PM
I can make some sample json for you tomorrow, clocking off for today
04:43
Thomas
04:43 PM
Thanks for the replies so far
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:43 PM
Sounds good

1

Thomas
Photo of md5-364d4bd42c5fa7cc676d57e1c52abbbc
Thomas
05:11 PM
is the full document stored in index/ram or can the document be fetched from disk?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:24 PM
Only indexed fields are stored in RAM, the full doc is stored on disk as a backup and also for unindexed fields not mentioned in the schema.

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3005 threads (79% resolved)

Join Our Community

Similar Threads

Optimizing Bulk Indexing and Reducing RAM Usage in Typesense

Timon experienced issues with Typesense becoming unresponsive during bulk indexing and sought advice. Jason recommended larger import requests and adjusting the client-side timeout allowance, revealing a need to increase RAM allocation for Docker. Kishore Nallan undertook to find ways to optimize memory usage, particularly for geopoint indexing.

2

48
24mo
Solved

Troubleshooting Typesense Document Import Error

Christopher had trouble importing 2.1M documents into Typesense due to memory errors. Jason clarified the system requirements, explaining the correlation between RAM and dataset size, and ways to tackle the issue. They both also discussed database-like query options.

3

30
10mo
Solved

Discussions on Typesense, Collections, and Dynamic Fields

Tugay shares plans to use Typesense for their SaaS platform and asks about collection sizes and sharding. Jason clarifies Typesense's capabilities and shares a beta feature. They discuss using unique collections per customer and new improvements. Kishore Nallan and Gabe comment on threading and data protection respectively.

3

45
35mo
Solved

Understanding Indexing and Search-As-You-Type In Typesense

Steven had queries about indexing and search-as-you-type in Typesense. Jason clarified that bulk updates are faster and search-as-you-type is resource intensive but worth it. The discussion also included querying benchmarks and Typesense's drop_tokens_threshold parameter, with participation from bnfd.

2

13
28mo
Solved

Multiple Filters and JSON Requests in Typesense

Manish asked about multiple filter_by arguments, JSON input, and using multisearch. Jason offered typesense documentation links, examples, and how to use JSON formatted requests with multisearch. Ed shared a similar use case.

6

44
5mo
Solved