#community-help

Resolving JSONL File Import Issues in Python

TLDR Jon struggles importing a large JSONL file using Python, encountering decode errors and size restrictions. Kishore Nallan instructs to use curl for imports under 10GB, and references an update to the Python client which could more capably handle large imports.

Powered by Struct AI
18
9mo
Solved
Join the chat
Jan 08, 2023 (9 months ago)
Jon
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Jon
04:50 PM
Hi, I am trying to import a large jsonl file. i believe the format of the jsonl file is correct. i am getting the following json decode error after a period of time:
Jan 09, 2023 (9 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:06 AM
Have you tried importing the file directly via curl to see if the same error happens?
Jon
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Jon
01:54 AM
Is there documentation on that? I thought the docs say that using import is most efficient.
01:55
Jon
01:55 AM
Should I use the API and insert one document at a time or use CURL - but where is the documentation
02:10
Jon
02:10 AM
Using CURL I get : request entity is too large
02:10
Jon
02:10 AM
These are 27GB files in Jason’s
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:20 AM
Import is the most efficient way. Are you using the official Python client? That also might be running into some memory limitations for the large file. You have to try splitting up the file into smaller chunks.
04:32
Kishore Nallan
04:32 AM
Someone recently also contributed a way to batch import via the use of Iterable here: https://github.com/typesense/typesense-python/pull/22

This is available in 0.15.0 version of the Python client that I've just published.
Jan 16, 2023 (9 months ago)
Jon
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Jon
05:57 PM
Ok thanks. Will look into this
Jan 17, 2023 (9 months ago)
Jon
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Jon
03:19 PM
i don't see why i should have to split things. i am using the python client. it seems like it's a bug in your code and makes this database non-actionable
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:24 PM
The new API added to the python client above allows you to just pass a file object. You can consume pretty large imports that way.
03:25
Kishore Nallan
03:25 PM
We index JSONL files that are several GBs in size.
Jon
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Jon
03:25 PM
what do you mean new API? i am using the python client. is there something else i need to be doing.
03:25
Jon
03:25 PM
i am using jsonl
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:29 PM
The earlier error message you posted here seems like an issue with parsing of the JSON, either the request or the response.
03:32
Kishore Nallan
03:32 PM
Typesense server also has a maximum POST body size of 10 GB.
03:34
Kishore Nallan
03:34 PM
The Python client needs to hold the file content in-memory while calling the API. This might not work for large datasets. But curl will work fine as long as POST data is less than 10 GB. So if your total dataset size is 28 GB, you will need to split into 3 files.