#community-help

Finding Similar Documents Using JSON and Embeddings

TLDR Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.

Powered by Struct AI

6

1

1

64
5mo
Solved
Join the chat
May 09, 2023 (5 months ago)
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:03 PM
Hey guys, I want to figure out similar docs, given a JSON doc. What's a good way to achieve that?
06:04
Manish
06:04 PM
Do I need to run a Sentence-BERT model as mentioned here: https://typesense.org/docs/0.24.0/api/vector-search.html#what-is-an-embedding ?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:08 PM
Yup, Sentence-BERT + vector query would be the way to go
06:09
Jason
06:09 PM
Or of course you also use embeddings from OpenAI for eg, but will be more expensive
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:09 PM
Interesting. How do I get embeddings from open ai?
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:10 PM
Interesting. For OpenAI route, do you reckon it would take JSON directly, or do I need to push it all as a single long text string?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:11 PM
I’d recommend pushing one long string into OpenAI / S-Bert

1

06:12
Jason
06:12 PM
*one long string per record
06:12
Jason
06:12 PM
For eg, notice how I concatenate product description, name, category and brand into one string before sending it to S-BERT for each product record here: https://github.com/typesense/showcase-ecommerce-store/blob/33e604a75b81f1de258ff185c775288b27db8335/scripts/vector-generation/main.py#L9
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:13 PM
I see. if I were to marshal JSON into a single string, do you reckon it's the same thing. Or, that might not work?
06:14
Manish
06:14 PM
{"key": "value", "key2": "value2", ...} style
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:20 PM
I’m not too sure, but I suspect the added punctuation might be noise and might eat into your token length?
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:21 PM
interesting. So, basically it's like a "bag of words" sort of approach.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:21 PM
Right, that’s how I’ve been doing it
06:24
Jason
06:24 PM
This suggests that punctuation doesn’t affect BERT: https://aclanthology.org/2020.pam-1.15.pdf
Image 1 for This suggests that punctuation doesn’t affect BERT: <https://aclanthology.org/2020.pam-1.15.pdf>
06:24
Jason
06:24 PM
Not sure about S-BERT and OpenAI
06:25
Jason
06:25 PM
But the fact that at least some models are punctuation sensitive, might make a case for not adding JSON-like punctuation into the mix
06:25
Jason
06:25 PM
since the original training data might not have too many JSON representations
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:26 PM
right. makes sense. Thinking I'll have one line per chat. All separated by newlines.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:27 PM
Another thing to experiment with, given that usually longer pieces of text do better with semantic search:

You could concatenate all messages from a single thread into one long string and generate embeddings for the full discussion thread…
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:27 PM
I suppose there's no way to tell wht's a "heading" etc.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:28 PM
Correct, and that doesn’t usually matter when generating embeddings, because it’s all text

1

Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:29 PM
and once I have embeddings, I just stick them into the doc I sent to Typesense.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:30 PM
Yup
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:31 PM
Found this in openAI code:
06:31
Manish
06:31 PM
// OpenAPI suggests replacing newlines (\n) in your input with a single space, as they
// have observed inferior results when newlines are present.
// E.g.
// "The food was delicious and the waiter..."
06:31
Manish
06:31 PM
So, yeah, no newlines then 🙂
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:32 PM
I see, good to know!
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:43 PM
Wao, I'm seeing 1500 floats come out of the openAI embedding system.
06:43
Manish
06:43 PM
for a single thread. Is that something typesense can work with?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:44 PM
Yup, you’d have to set the num_dim property in the field definitioin in the Typesense collection schema to the number of floats you see
06:44
Jason
06:44 PM
Think it’s 1536
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:45 PM
could multiple docs have different number of floats? Yup, it's 1536
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:45 PM
No, a particular model will always generate the same number of floats for all types of input
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:45 PM
ahh.. interesting
06:46
Manish
06:46 PM
you're right. OpenAI site shows this: "text-embedding-ada-002 cl100k_base 8191 1536"

1

06:48
Manish
06:48 PM
does typesense use these for search as well as similarity?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:48 PM
Typesense uses these any time you use the vector_query parameter… which you’d use for similarity search and semantic search
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:52 PM
Got it. So, I have a search term, I need to generate it's embeddings first as well, to do a similarity search. OTOH, if I want similar docs, then just give the doc id to TS.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:52 PM
Yup exactly
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:53 PM
hmm... generating embedding per search term might be expensive with openai. So, probably more useful for doc similarity.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:53 PM
In the upcoming version of Typesense, we’re adding a way to automatically generate embeddings from within Typesense, so you don’t have to do that extra step both during indexing and search time. And you can then combine regular keyword search with semantic search in a single step

2

1

1

Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:53 PM
That'd save a bunch of trouble.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:54 PM
Yeah for sure! We’re also adding an integration with OpenAI, so you can use their embedding models from within Typesense
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:54 PM
But, I suppose openAI is better than other models that are available? So, there might be a quality difference?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:54 PM
Word on the street is that OpenAI embeddings model quality is not that great… which is surprising, but it might be use-case dependent
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:55 PM
interesting.
06:55
Manish
06:55 PM
which one do you recommend?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:57 PM
S-Bert is a good general purpose model. But I heard that Microsoft’s E5 Text Embedding Model is beating a lot of benchmarks…
06:57
Jason
06:57 PM
Trying to find the tweet where I read these benchmarks…
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
06:57 PM
06:57
Manish
06:57 PM
says "instructor-xl" is the top
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:58 PM
Interesting, haven’t heard of it yet
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
07:09 PM
&gt; We’re also adding an integration with OpenAI,
How would that work? For each search request, you'd have to send a call to openAI apis?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:10 PM
Correct
07:10
Jason
07:10 PM
We’re going to only enable it for non-prefix searches, meaning we’ll disable it for search-as-you-type use-cases, to avoid running up OpenAI bills
07:11
Jason
07:11 PM
Plus, when you index a document, you’ll indicate which fields in the JSON document to concatenate and send over to OpenAI to generate embeddings with
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
07:11 PM
yup -- basically, what I'm doing right now.
07:12
Manish
07:12 PM
That integration would be great. And technically, you could even run your own E5 servers.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:12 PM
Yup! We’re also going to ship E5 with Typesense, so if you don’t want to make external API calls you can use that model natively
Manish
Photo of md5-f0a83cd20895941fd74c026f9f15b61f
Manish
07:13 PM
nice.