does CPU not get fully utilized during jsonl import and also typesense #community-help

does CPU not get fully utilized during jsonl impor...

Jonathan Otto

06/20/2024, 5:14 PM

does CPU not get fully utilized during jsonl import? (and also during boot up) in my tests, the cpu never goes above 1-3 cores, and total throughput is roughly 30k docs/sec, (all http req POST immediately, in parallel, but they all wait for the response to return). during boot, it's the same story, slow load and there is plenty of IO capacity (and CPU) • with 16 cores (32 hyperthreads), 128GB of ram • indexing a collection with only a single string field (i'm trying to narrow down the root cause) • max file descriptors • no cpu/ram/disk/io limits, not other activity on the machine • using jsonl import (tried default batch size of 40, and up to 10000 no difference) • adjusted chunk size sent via jsonl import from 5000 docs to 100,000 docs (no difference) env vars

Copy code

TYPESENSE_HEALTHY_READ_LAG=10000
TYPESENSE_HEALTHY_WRITE_LAG=10000
TYPESENSE_NUM_DOCUMENTS_PARALLEL_LOAD=1000000
TYPESENSE_THREAD_POOL_SIZE=10000

Jonathan Otto

06/20/2024, 5:22 PM

have tested versions 0.25 and 26.0 so far, and using long timeouts

Jonathan Otto

06/20/2024, 5:54 PM

(and fast, local SSD, striped across 4 drives)

Óscar Vicente

06/20/2024, 6:12 PM

I face the same issue, I came to the conclusion that it is optimezed to load multiple collections at the same time, not the same collection in parallel. Maybe the bottleneck is actually updating the data structure in memory.

Jonathan Otto

06/20/2024, 6:18 PM

i saw your earlier post oscar, and i had a similar suspicion. surprised no one else (seemingly) has suggested this (in github issues or here in slack)

Óscar Vicente

06/20/2024, 6:19 PM

Probably my dataset is bigger so it's more obvious. I'm at 34 gb on ram on a single collection xD

Jonathan Otto

06/20/2024, 6:19 PM

34gb is not so big though, i'd like to do 500gb or more

Jonathan Otto

06/20/2024, 6:20 PM

but at 30k docs/sec jsonl ingestion (and bootup), it's not practical

Óscar Vicente

06/20/2024, 6:20 PM

In ram? You better have a cluster and expect a downtime of 6-8h for restarting. Not practical at all

Jonathan Otto

06/20/2024, 6:20 PM

agreed

Óscar Vicente

06/20/2024, 6:20 PM

Are you sure you need all those fields indexed?

Óscar Vicente

06/20/2024, 6:21 PM

My dataset is about 300, but the indexed fields are only 34gb

Jonathan Otto

06/20/2024, 6:21 PM

my next approach is to cut down on what's being indexed, but before i go down that route i want to verify whether this lack of cpu/resource utilization is expected

Óscar Vicente

06/20/2024, 6:22 PM

It's directly proportional to the fields with index: true, and the rest of the fields like sort/infix make it worse. Stemming with your locale improves it, probably because it will index less chars.

Óscar Vicente

06/20/2024, 6:23 PM

Just make the field you don't want to search for as optional and non-indexed, it will be way better

Jonathan Otto

06/20/2024, 6:23 PM

well i dont care so much about how much ram it uses... i care that it doesn't fully utilize the CPU to load data

Óscar Vicente

06/20/2024, 6:24 PM

Agree, another option is to split in multiple collections and usethe multi_search endpoint

Óscar Vicente

06/20/2024, 6:24 PM

Like split by time, or a secuential id or whatever

Jonathan Otto

06/20/2024, 6:24 PM

yea multi collections is interesting... almost like a poor man's sharding

Óscar Vicente

06/20/2024, 6:25 PM

It's more like partitioning rather than sharding. You can partition your data within the same server (like with sql dbs),

Jason Bosco

06/20/2024, 10:39 PM

@Jonathan Otto Responding to your question in this thread, since I want to reference something from here.

is typesense CPU bound?

Yeah, within some parameters (more details below).

can more than 1k-5k docs per api call happen if there are N cores?

Yeah, should be possible.

my experience thus far is indexing (on a single collection) seems to be bottlenecked at 1 core only

Every field has a standalone index, and updating this index requires a write lock. So Typesense only parallelizes indexing across fields inside a document, and across different collections. So if you only have a single field in your document, then this is effectively a serial operation. The only way to speed this particular scenario up would be to use a higher clock-speed processor. But usually what we've seen is that documents contain multiple indexed fields, (usually more fields than available cores), so that way all CPU core available are utilized well

🎉 1

Jonathan Otto

06/20/2024, 11:41 PM

thank you jason, this is very helpful. i understand the design/arch now. and, thank you for a great product and contribution to the software/infra world

🙏 1

Jason Bosco

06/21/2024, 12:33 AM

Thanks Jonathan!

Open in Slack

Previous Next