Hey everyone I have an interesting issue I ve got a massive typesense #community-help

Hey everyone, I have an interesting issue. I’ve go...

Matt Roberts

03/22/2023, 11:28 AM

Hey everyone, I have an interesting issue. I’ve got a massive json file with a load of embedded texts generated via open ai, however I’m not sure how to get the json into typesense. Whenever I try to determine the structure of the json my laptop freezes. (json file is 170mb). Any ideas of how to get round this issue?

Kishore Nallan

03/22/2023, 11:29 AM

👋 can you try doing "less filename.json" from the shell?

Matt Roberts

03/22/2023, 11:36 AM

hmm, sorry no-coder alert, is this done via the terminal? or within the typesense platform?

Kishore Nallan

03/22/2023, 11:37 AM

From your terminal

Kishore Nallan

03/22/2023, 11:38 AM

Go to the directory where the file is stored and then run the less command. You can use 'cd dirname' to switch directories on the terminal.

Matt Roberts

03/22/2023, 11:40 AM

nice, so now I’m here

Matt Roberts

03/22/2023, 11:41 AM

Do I need to determine the schema of this json before I can load it into typesense?

Matt Roberts

03/22/2023, 11:41 AM

if so, how?

Kishore Nallan

03/22/2023, 11:41 AM

Is your goal to index just embeddings (i.e. float values) or text as well?

Matt Roberts

03/22/2023, 11:43 AM

hmm, i think so I’m trying to build a QA bot on 200+ substack articles. This json file is the result of a process that chunked the html articles into paragraphs and then passed them to the embeddings api

Matt Roberts

03/22/2023, 11:44 AM

I’d need the text also so that I can pass this to the chat gpt api (i think)

Kishore Nallan

03/22/2023, 11:48 AM

Okay I will explain the generic bits and you can choose what works best for your use case: 1. You can scroll through the large json document via

less

and identify the field names that need to be indexed. 2. Then in Typesense you will create a collection. Each collection has a schema which lists the fields to be indexed and their types (what you identified in previous step) 3. Once you create a collection, you have to now import the file into Typesense. For this you have to convert the JSON file to JSONL file. The jsonl (json line) format contains 1 JSON document per line instead of using an array structure that JSON uses (which makes it hard to ingest large files like this). To convert JSON to JSONL you can use the

jq

tool as described here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-json-file 4. You are now ready to import the file and then search on it. You can refer to the docs on how to do this.

Matt Roberts

03/22/2023, 11:51 AM

Awesome. I’ll have a bash and see how far I can get

Matt Roberts

03/22/2023, 11:51 AM

thanks for the help

Matt Roberts

03/22/2023, 11:51 AM

😂 i’ll be back

👍 1

Matt Roberts

03/22/2023, 12:09 PM

the json has different sections, so what do you suggest I’d need to take. My simple assumption is i’d need the ‘text’ and then all the vectors associated to that text, but not sure which keys [ “child_indices”*,* “doc_hash”*,* “doc_id”*,* “embedding”*,* “extra_info”*,* “image”*,* “index”*,* “node_info”*,* “ref_doc_id”*,* “text” ]

Kishore Nallan

03/22/2023, 12:20 PM

embedding field is likely the vector values

Gio Kakhiani

03/24/2023, 6:29 PM

@Matt Roberts did you get this sorted? 🤔

Matt Roberts

03/26/2023, 6:57 PM

Nope. but it’s my knowledge that’s holding me back https://gpt-index.readthedocs.io/en/latest/reference/indices/vector_store.html#gpt_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex this is the class that was used The GPTSimpleVectorIndex is a data structure where nodes are keyed by embeddings, and those embeddings are stored within a simple dictionary. During index construction, the document texts are chunked up, converted to nodes with text; they are then encoded in document embeddings stored within the dict. so the json file is split into 2. For typesense i’m trying to get to the simple array structure “text” “vector”

Open in Slack

Previous Next