Hey everyone, I have an interesting issue. I’ve go...
# community-help
m
Hey everyone, I have an interesting issue. I’ve got a massive json file with a load of embedded texts generated via open ai, however I’m not sure how to get the json into typesense. Whenever I try to determine the structure of the json my laptop freezes. (json file is 170mb). Any ideas of how to get round this issue?
k
👋 can you try doing "less filename.json" from the shell?
m
hmm, sorry no-coder alert, is this done via the terminal? or within the typesense platform?
k
From your terminal
Go to the directory where the file is stored and then run the less command. You can use 'cd dirname' to switch directories on the terminal.
m
nice, so now I’m here
Do I need to determine the schema of this json before I can load it into typesense?
if so, how?
k
Is your goal to index just embeddings (i.e. float values) or text as well?
m
hmm, i think so I’m trying to build a QA bot on 200+ substack articles. This json file is the result of a process that chunked the html articles into paragraphs and then passed them to the embeddings api
I’d need the text also so that I can pass this to the chat gpt api (i think)
k
Okay I will explain the generic bits and you can choose what works best for your use case: 1. You can scroll through the large json document via
less
and identify the field names that need to be indexed. 2. Then in Typesense you will create a collection. Each collection has a schema which lists the fields to be indexed and their types (what you identified in previous step) 3. Once you create a collection, you have to now import the file into Typesense. For this you have to convert the JSON file to JSONL file. The jsonl (json line) format contains 1 JSON document per line instead of using an array structure that JSON uses (which makes it hard to ingest large files like this). To convert JSON to JSONL you can use the
jq
tool as described here: https://typesense.org/docs/0.24.0/api/documents.html#import-a-json-file 4. You are now ready to import the file and then search on it. You can refer to the docs on how to do this.
m
Awesome. I’ll have a bash and see how far I can get
thanks for the help
😂 i’ll be back
👍 1
the json has different sections, so what do you suggest I’d need to take. My simple assumption is i’d need the ‘text’ and then all the vectors associated to that text, but not sure which keys [ “child_indices”*,* “doc_hash”*,* “doc_id”*,* “embedding”*,* “extra_info”*,* “image”*,* “index”*,* “node_info”*,* “ref_doc_id”*,* “text” ]
k
embedding field is likely the vector values
g
@Matt Roberts did you get this sorted? 🤔
m
Nope. but it’s my knowledge that’s holding me back https://gpt-index.readthedocs.io/en/latest/reference/indices/vector_store.html#gpt_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex this is the class that was used The GPTSimpleVectorIndex is a data structure where nodes are keyed by embeddings, and those embeddings are stored within a simple dictionary. During index construction, the document texts are chunked up, converted to nodes with text; they are then encoded in document embeddings stored within the dict. so the json file is split into 2. For typesense i’m trying to get to the simple array structure “text” “vector”