Hi everyone! I'm using Typesense to index a large ...
# community-help
a
Hi everyone! I'm using Typesense to index a large CSV of public records in an effort to enable government transparency. For this purpose, I created a python script that loads a huge csv file, cleans it, separates it in chunks and imports records in parallel through the Typesense API. While the first few thousand records load correctly, I start getting write timeouts from the Typesense library seconds later:
Copy code
ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))
I tried retrying failing requests, but I can't even seem to catch the exceptions in the
import_
function. My instance should have more than enough memory and CPU to handle everything. (10 cores/20 gb RAM with a 8gb dataset) Any ideas?
k
1. What does the Typesense log say? 2. Are you using the import API? If so, you don't have to parallelize the writes: the API itself has batching parameter that allows you to send large data in.
a
1. (Edit: I'm uploading full logs) 2. Because the dataset is bigger than local memory (> 8 gb) I can't load it all at once, so I use Dask Dataframes to load out-of-core and operate on chunks.
k
Can you generate a single JSONL file for your dataset (convert your CSV into a single JSONL file) and then try importing that using a single API call?
a
They don't show any error
k
Yeah the logs look fine. What's the free memory now on the machine?
a
And yes, I can probably to a middle step and convert first to a giant JSONL and then try the import.
k
I do see a
response_abort called
in the logs though.
That is logged when the import is aborted in the middle when the client disconnects.
I think you will have great success with a single JSONL file. Our import API is completely streaming and we regularly import several gigabytes of JSONL files into Typesense without any issues.
🙌 1
a
I won't need to load the JSONL file to memory, right?
k
The write timeouts could be an indication that the server is getting more writes than it can finish indexing within the default timeout interval. With a streaming import, we automatically handle this scenario.
I won't need to load the JSONL file to memory, right?
Will depend on the http client used. It should be smart enough. Otherwise, just use CURL.
a
Seeing the API reference, it seems the Typesense library for Python would use batches while uploading JSONL, is that right?
k
The Python client does not accept a file name so it must be either a list of document objects or a stringified jsonl representation of the documents, neither of which will be streaming in nature.
So I suggest using CURL.
a
I understand. Thank you!
k
I have taken note to improve this aspect of the client.
a
My experience with Typesense has been awesome so far. We're migrating from AppSearch and we're just incredibly impressed by its performance.
Thank you for making the project open source!
k
Glad to hear, thank you for the feedback. Definitely want to keep making it better.
a
We would use Typesense Cloud, but we're an NGO with a very limited budget, so we're taking advantage of AWS credits at the moment. We will definitely consider migrating in the future.
k
Understood 👍
r
just as an aside I ran into similar yesterday and also had to chunk my upsert, which had 1000 documents, into something like 200 each chunk, for it to work.
I am using a weak server, but my dataset is small.
k
Did you make a single import API call? Or ran imports parallely via code?
r
Copy code
typesense_client.collections['collection'].documents.import_(documents, {'action': 'upsert'})
that's what I do
k
And the client times out?
r
yeah, and at times also gave me an error parsing the response from the server. tbh I chunked it and didn't look at it more, since that fixed it.
k
Client timeout can occur depending on the size and number of documents, in which case you can bump up the client timeout value. It should then complete gracefully. It's a client config change.
r
I did but it seemed to be timing out before my set timeout which was 10s
anyway I can revisit it
gonna have to make changes to my importer again
k
If request is terminated abruptly then it can also cause parse error.
If you can reproduce it consistently on a dataset you can share then I will be happy to troubleshoot.
r
yeah I can share this dataset, I will take a look at it next time I am working on it and provide you with the dataset and the code that imports it.
👍 1
s
i too had to increase my timeout to 10s for a 1000 doc import ..sometimes it would just drop few files... anyway increasing timeout fixed it
a
I generated JSONL files and tried curl, but now I get "curl: (18) transfer closed with outstanding read data remaining" from curl every time I upload a file.
@Kishore Nallan
k
Can you please tell me: a) How many lines does the generated JSONL file contain b) What exact CURL command you are using? c) How long does the CURL command run before getting this error? d) After the curl command fails, how many records were imported successfully on Typesense?
a
a) 229061 (each jsonl file) b)
Copy code
curl -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" -X POST --data-binary @$FILE \
      "http://$ENDPOINT/collections/revenue_entry/documents/import?action=create"
c) 1m 15s d) Around 70k
k
Got it. Also, what is the size of the file with 229061 lines?
a
170MB
k
Hmm 🤔 That is perfectly fine and it should not time out this way... We regularly import files with millions of records for our demo. Are you importing these files from you local machine to remote machine? Or are you running the imports within the remote machine? Can you try doing the latter to see if it helps?
Basically, try copying a file onto the remote machine and then running the import command locally on that machine to see if it helps.
a
Local to remote, less than 80 ms latency to my container on AWS.
k
Okay, let's rule out any networking quirks between local to remote first.
a
There's an AWS Application Load Balancer in front of the container. I'll check its configuration.
k
If it repeatedly fails exactly at the 1min 15 secs interval, there might be some form of timeout configuration somewhere.
a
Can't find anything on the AWS ALB config.
k
I think if you just copied the file inside the instance and ran the import, we can rule this out easily. If the issue occurs again, then we know it is not network related.
a
@Kishore Nallan I just uploaded the file from an instance in the same internal network as the container and I got the same result.
Wait, I think using a different subnet worked.
k
Is Typesense running inside Docker on a plain EC2 instance?
a
It's running on Docker with Fargate.
For now my workaround would be to upload through SFTP to a secondary instance and upload through the internal network. Still not sure what's causing the limit, but it's definitely a network component in AWS.
k
Ok glad to hear that it works with the work around. The application load balancer might have a default idle timeout which might have to be increased.