#community-help

Handling Large JSON File for Typesense

TLDR Matt struggled with processing a large JSON file for Typesense. Kishore Nallan explained how to create a schema, convert to JSONL, and import the file. They also identified the necessary keys from the JSON.

Powered by Struct AI

1

Mar 22, 2023 (9 months ago)
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
11:28 AM
Hey everyone, I have an interesting issue. I’ve got a massive json file with a load of embedded texts generated via open ai, however I’m not sure how to get the json into typesense. Whenever I try to determine the structure of the json my laptop freezes. (json file is 170mb). Any ideas of how to get round this issue?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:29 AM
👋 can you try doing "less filename.json" from the shell?
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
11:36 AM
hmm, sorry no-coder alert, is this done via the terminal? or within the typesense platform?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:37 AM
From your terminal
11:38
Kishore Nallan
11:38 AM
Go to the directory where the file is stored and then run the less command. You can use 'cd dirname' to switch directories on the terminal.
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
11:40 AM
nice, so now I’m here
Image 1 for nice, so now I’m here
11:41
Matt
11:41 AM
Do I need to determine the schema of this json before I can load it into typesense?
11:41
Matt
11:41 AM
if so, how?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:41 AM
Is your goal to index just embeddings (i.e. float values) or text as well?
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
11:43 AM
hmm, i think so

I’m trying to build a QA bot on 200+ substack articles. This json file is the result of a process that chunked the html articles into paragraphs and then passed them to the embeddings api
11:44
Matt
11:44 AM
I’d need the text also so that I can pass this to the chat gpt api (i think)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:48 AM
Okay I will explain the generic bits and you can choose what works best for your use case:

1. You can scroll through the large json document via less and identify the field names that need to be indexed.
2. Then in Typesense you will create a collection. Each collection has a schema which lists the fields to be indexed and their types (what you identified in previous step)
3. Once you create a collection, you have to now import the file into Typesense. For this you have to convert the JSON file to JSONL file. The jsonl (json line) format contains 1 JSON document per line instead of using an array structure that JSON uses (which makes it hard to ingest large files like this). To convert JSON to JSONL you can use the jq tool as described here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-json-file
4. You are now ready to import the file and then search on it. You can refer to the docs on how to do this.
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
11:51 AM
Awesome. I’ll have a bash and see how far I can get
11:51
Matt
11:51 AM
thanks for the help
11:51
Matt
11:51 AM
😂 i’ll be back
Image 1 for :joy:  i’ll be back

1

12:09
Matt
12:09 PM
the json has different sections, so what do you suggest I’d need to take.

My simple assumption is i’d need the ‘text’ and then all the vectors associated to that text, but not sure which keys

[*
“child_indices”
,*
“doc_hash”,*
“doc_id”
,*
“embedding”,*
“extra_info”
,*
“image”,*
“index”
,*
“node_info”,*
“ref_doc_id”
,*
“text”
]
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:20 PM
embedding field is likely the vector values
Mar 24, 2023 (9 months ago)
Gio
Photo of md5-fca9a1d3ceee0b8b4bedc936ea93e48b
Gio
06:29 PM
Matt did you get this sorted? :thinking_face:
Mar 26, 2023 (8 months ago)
Matt
Photo of md5-a58e938279f99d9db903cbaaf8e35d60
Matt
06:57 PM
Nope. but it’s my knowledge that’s holding me back

https://gpt-index.readthedocs.io/en/latest/reference/indices/vector_store.html#gpt_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex

this is the class that was used

The GPTSimpleVectorIndex is a data structure where nodes are keyed by embeddings, and those embeddings are stored within a simple dictionary. During index construction, the document texts are chunked up, converted to nodes with text; they are then encoded in document embeddings stored within the dict.

so the json file is split into 2. For typesense i’m trying to get to the simple array structure “text” “vector”

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Issues with Importing Typesense Collection to Different Server

Kevin had problems migrating a Typesense collection between Docusaurus sites on different machines. Jason advised them on JSONL format, handling server hosting, and creating a collection schema before importing documents, leading to successful import.

3

35
3mo

Optimizing JSON Data Import and Search with PHP Library

Arumugam experienced slow search results and data import time with json and PHP library. Kishore Nallan advised trying import and provided guidance on integrating the function into a PHP application. The issue was resolved as the search results are now displayed faster.

32
24mo

Revisiting Typesense for Efficient DB Indexing and Querying

kopach experienced slow indexing and crashes with Typesense. The community suggested to use batch import and check the server's resources. Improvements were made but additional support was needed for special characters and multi-search queries.

1

46
9mo

Troubleshooting Indexing Duration in Typesense Import

Alan asked about lengthy indexing times for importing documents to Typesense. Jason suggested various potential causes, including network connectivity and system resources. They later identified the problem to be an error in Alan's code.

5

43
15mo

Troubleshooting Typesense Document Import Error

Christopher had trouble importing 2.1M documents into Typesense due to memory errors. Jason clarified the system requirements, explaining the correlation between RAM and dataset size, and ways to tackle the issue. They both also discussed database-like query options.

3

30
10mo