does CPU not get fully utilized during jsonl impor...
# community-help
j
does CPU not get fully utilized during jsonl import? (and also during boot up) in my tests, the cpu never goes above 1-3 cores, and total throughput is roughly 30k docs/sec, (all http req POST immediately, in parallel, but they all wait for the response to return). during boot, it's the same story, slow load and there is plenty of IO capacity (and CPU) • with 16 cores (32 hyperthreads), 128GB of ram • indexing a collection with only a single string field (i'm trying to narrow down the root cause) • max file descriptors • no cpu/ram/disk/io limits, not other activity on the machine • using jsonl import (tried default batch size of 40, and up to 10000 no difference) • adjusted chunk size sent via jsonl import from 5000 docs to 100,000 docs (no difference) env vars
Copy code
TYPESENSE_HEALTHY_READ_LAG=10000
TYPESENSE_HEALTHY_WRITE_LAG=10000
TYPESENSE_NUM_DOCUMENTS_PARALLEL_LOAD=1000000
TYPESENSE_THREAD_POOL_SIZE=10000
have tested versions 0.25 and 26.0 so far, and using long timeouts
(and fast, local SSD, striped across 4 drives)
ó
I face the same issue, I came to the conclusion that it is optimezed to load multiple collections at the same time, not the same collection in parallel. Maybe the bottleneck is actually updating the data structure in memory.
j
i saw your earlier post oscar, and i had a similar suspicion. surprised no one else (seemingly) has suggested this (in github issues or here in slack)
ó
Probably my dataset is bigger so it's more obvious. I'm at 34 gb on ram on a single collection xD
j
34gb is not so big though, i'd like to do 500gb or more
but at 30k docs/sec jsonl ingestion (and bootup), it's not practical
ó
In ram? You better have a cluster and expect a downtime of 6-8h for restarting. Not practical at all
j
agreed
ó
Are you sure you need all those fields indexed?
My dataset is about 300, but the indexed fields are only 34gb
j
my next approach is to cut down on what's being indexed, but before i go down that route i want to verify whether this lack of cpu/resource utilization is expected
ó
It's directly proportional to the fields with index: true, and the rest of the fields like sort/infix make it worse. Stemming with your locale improves it, probably because it will index less chars.
Just make the field you don't want to search for as optional and non-indexed, it will be way better
j
well i dont care so much about how much ram it uses... i care that it doesn't fully utilize the CPU to load data
ó
Agree, another option is to split in multiple collections and usethe multi_search endpoint
Like split by time, or a secuential id or whatever
j
yea multi collections is interesting... almost like a poor man's sharding
ó
It's more like partitioning rather than sharding. You can partition your data within the same server (like with sql dbs),
j
@Jonathan Otto Responding to your question in this thread, since I want to reference something from here.
is typesense CPU bound?
Yeah, within some parameters (more details below).
can more than 1k-5k docs per api call happen if there are N cores?
Yeah, should be possible.
my experience thus far is indexing (on a single collection) seems to be bottlenecked at 1 core only
Every field has a standalone index, and updating this index requires a write lock. So Typesense only parallelizes indexing across fields inside a document, and across different collections. So if you only have a single field in your document, then this is effectively a serial operation. The only way to speed this particular scenario up would be to use a higher clock-speed processor. But usually what we've seen is that documents contain multiple indexed fields, (usually more fields than available cores), so that way all CPU core available are utilized well
🎉 1
j
thank you jason, this is very helpful. i understand the design/arch now. and, thank you for a great product and contribution to the software/infra world
🙏 1
j
Thanks Jonathan!