what is the fastest know way to sync a large datas...
# community-help
h
what is the fastest know way to sync a large dataset (30MM docs)?
j
JSONL + curl + parallel… See the code snippet here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-jsonl-file
(Under “Shell”)
Copy code
# If you have a large JSONL file, 
#   you can split the file and 
#   parallelize the import using this one liner:
parallel --block -5 -a documents.jsonl --tmpdir /tmp --pipepart --cat 'curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X POST -T {} <http://localhost:8108/collections/companies/documents/import?action=create>'
h
I'm using sidekiq on my local to run 500 jobs simultaneously, my estimate is that this will take 18h to finish, that's a lot of time in case I need to re-denx something in production
j
Are you using the batch import endpoint?
h
no, 1 by 1
j
Yeah that will definitely be slow. You want to use the batch import endpoint instead, and send as high as 10K docs per batch API call (make sure you increase the ruby client timeout to say 10 minutes)
You want to have N sidekiq processes doing parallel import API calls, where N is the number of CPU cores you have minus 1
h
nice! i will try it, thanks!
👍 1
j
Also, heads-up about handling 503s in your code. See “Handling HTTP 503s” here: https://typesense.org/docs/guide/syncing-data-into-typesense.html#handling-http-503s