Handling Large JSON File for Typesense
TLDR Matt struggled with processing a large JSON file for Typesense. Kishore Nallan explained how to create a schema, convert to JSONL, and import the file. They also identified the necessary keys from the JSON.
1
Mar 22, 2023 (9 months ago)
Matt
11:28 AMKishore Nallan
11:29 AMMatt
11:36 AMKishore Nallan
11:37 AMKishore Nallan
11:38 AMMatt
11:40 AMMatt
11:41 AMMatt
11:41 AMKishore Nallan
11:41 AMMatt
11:43 AMI’m trying to build a QA bot on 200+ substack articles. This json file is the result of a process that chunked the html articles into paragraphs and then passed them to the embeddings api
Matt
11:44 AMKishore Nallan
11:48 AM1. You can scroll through the large json document via
less
and identify the field names that need to be indexed. 2. Then in Typesense you will create a collection. Each collection has a schema which lists the fields to be indexed and their types (what you identified in previous step)
3. Once you create a collection, you have to now import the file into Typesense. For this you have to convert the JSON file to JSONL file. The jsonl (json line) format contains 1 JSON document per line instead of using an array structure that JSON uses (which makes it hard to ingest large files like this). To convert JSON to JSONL you can use the
jq
tool as described here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-json-file4. You are now ready to import the file and then search on it. You can refer to the docs on how to do this.
Matt
11:51 AMMatt
11:51 AMMatt
11:51 AM1
Matt
12:09 PMMy simple assumption is i’d need the ‘text’ and then all the vectors associated to that text, but not sure which keys
[*
“child_indices”,*
“doc_hash”,*
“doc_id”,*
“embedding”,*
“extra_info”,*
“image”,*
“index”,*
“node_info”,*
“ref_doc_id”,*
“text”
]
Kishore Nallan
12:20 PMMar 24, 2023 (9 months ago)
Gio
06:29 PMMar 26, 2023 (8 months ago)
Matt
06:57 PMhttps://gpt-index.readthedocs.io/en/latest/reference/indices/vector_store.html#gpt_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex
this is the class that was used
The GPTSimpleVectorIndex is a data structure where nodes are keyed by embeddings, and those embeddings are stored within a simple dictionary. During index construction, the document texts are chunked up, converted to nodes with text; they are then encoded in document embeddings stored within the dict.
so the json file is split into 2. For typesense i’m trying to get to the simple array structure “text” “vector”
Typesense
Indexed 3015 threads (79% resolved)
Similar Threads
Issues with Importing Typesense Collection to Different Server
Kevin had problems migrating a Typesense collection between Docusaurus sites on different machines. Jason advised them on JSONL format, handling server hosting, and creating a collection schema before importing documents, leading to successful import.
Optimizing JSON Data Import and Search with PHP Library
Arumugam experienced slow search results and data import time with json and PHP library. Kishore Nallan advised trying import and provided guidance on integrating the function into a PHP application. The issue was resolved as the search results are now displayed faster.
Revisiting Typesense for Efficient DB Indexing and Querying
kopach experienced slow indexing and crashes with Typesense. The community suggested to use batch import and check the server's resources. Improvements were made but additional support was needed for special characters and multi-search queries.
Troubleshooting Indexing Duration in Typesense Import
Alan asked about lengthy indexing times for importing documents to Typesense. Jason suggested various potential causes, including network connectivity and system resources. They later identified the problem to be an error in Alan's code.
Troubleshooting Typesense Document Import Error
Christopher had trouble importing 2.1M documents into Typesense due to memory errors. Jason clarified the system requirements, explaining the correlation between RAM and dataset size, and ways to tackle the issue. They both also discussed database-like query options.