what is the fastest know way to sync a large dataset 30MM do typesense #community-help

Join Slack

what is the fastest know way to sync a large datas...

# community-help

Henrique Januário

03/20/2023, 10:54 PM

what is the fastest know way to sync a large dataset (30MM docs)?

Jason Bosco

03/20/2023, 10:56 PM

JSONL + curl + parallel… See the code snippet here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-jsonl-file

Jason Bosco

03/20/2023, 10:56 PM

(Under “Shell”)

Jason Bosco

03/20/2023, 10:56 PM

Copy code

# If you have a large JSONL file, 
#   you can split the file and 
#   parallelize the import using this one liner:
parallel --block -5 -a documents.jsonl --tmpdir /tmp --pipepart --cat 'curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X POST -T {} <http://localhost:8108/collections/companies/documents/import?action=create>'

Henrique Januário

03/20/2023, 10:56 PM

I'm using sidekiq on my local to run 500 jobs simultaneously, my estimate is that this will take 18h to finish, that's a lot of time in case I need to re-denx something in production

Jason Bosco

03/20/2023, 10:57 PM

Are you using the batch import endpoint?

Henrique Januário

03/20/2023, 10:57 PM

no, 1 by 1

Jason Bosco

03/20/2023, 10:57 PM

Yeah that will definitely be slow. You want to use the batch import endpoint instead, and send as high as 10K docs per batch API call (make sure you increase the ruby client timeout to say 10 minutes)

Jason Bosco

03/20/2023, 10:58 PM

You want to have N sidekiq processes doing parallel import API calls, where N is the number of CPU cores you have minus 1

Henrique Januário

03/20/2023, 10:59 PM

nice! i will try it, thanks!

👍 1

Jason Bosco

03/20/2023, 10:59 PM

Also, heads-up about handling 503s in your code. See “Handling HTTP 503s” here: https://typesense.org/docs/guide/syncing-data-into-typesense.html#handling-http-503s

Open in Slack

Previous Next