Hey guys, I want to figure out similar docs, given...
# community-help
m
Hey guys, I want to figure out similar docs, given a JSON doc. What's a good way to achieve that?
Do I need to run a Sentence-BERT model as mentioned here: https://typesense.org/docs/0.24.0/api/vector-search.html#what-is-an-embedding ?
j
Yup, Sentence-BERT + vector query would be the way to go
Or of course you also use embeddings from OpenAI for eg, but will be more expensive
m
Interesting. How do I get embeddings from open ai?
m
Interesting. For OpenAI route, do you reckon it would take JSON directly, or do I need to push it all as a single long text string?
j
I’d recommend pushing one long string into OpenAI / S-Bert
👍 1
*one long string per record
For eg, notice how I concatenate product description, name, category and brand into one string before sending it to S-BERT for each product record here: https://github.com/typesense/showcase-ecommerce-store/blob/33e604a75b81f1de258ff185c775288b27db8335/scripts/vector-generation/main.py#L9
m
I see. if I were to marshal JSON into a single string, do you reckon it's the same thing. Or, that might not work?
{"key": "value", "key2": "value2", ...} style
j
I’m not too sure, but I suspect the added punctuation might be noise and might eat into your token length?
m
interesting. So, basically it's like a "bag of words" sort of approach.
j
Right, that’s how I’ve been doing it
This suggests that punctuation doesn’t affect BERT: https://aclanthology.org/2020.pam-1.15.pdf
Not sure about S-BERT and OpenAI
But the fact that at least some models are punctuation sensitive, might make a case for not adding JSON-like punctuation into the mix
since the original training data might not have too many JSON representations
m
right. makes sense. Thinking I'll have one line per chat. All separated by newlines.
👍 1
j
Another thing to experiment with, given that usually longer pieces of text do better with semantic search: You could concatenate all messages from a single thread into one long string and generate embeddings for the full discussion thread…
m
I suppose there's no way to tell wht's a "heading" etc.
j
Correct, and that doesn’t usually matter when generating embeddings, because it’s all text
👍 1
m
and once I have embeddings, I just stick them into the doc I sent to Typesense.
j
Yup
m
Found this in openAI code:
// OpenAPI suggests replacing newlines (\n) in your input with a single space, as they // have observed inferior results when newlines are present. // E.g. // "The food was delicious and the waiter..."
So, yeah, no newlines then 🙂
j
I see, good to know!
m
Wao, I'm seeing 1500 floats come out of the openAI embedding system.
for a single thread. Is that something typesense can work with?
j
Yup, you’d have to set the
num_dim
property in the field definitioin in the Typesense collection schema to the number of floats you see
Think it’s
1536
m
could multiple docs have different number of floats? Yup, it's 1536
j
No, a particular model will always generate the same number of floats for all types of input
m
ahh.. interesting
you're right. OpenAI site shows this: "text-embedding-ada-002 cl100k_base 8191 1536"
👍 1
does typesense use these for search as well as similarity?
j
Typesense uses these any time you use the
vector_query
parameter… which you’d use for similarity search and semantic search
m
Got it. So, I have a search term, I need to generate it's embeddings first as well, to do a similarity search. OTOH, if I want similar docs, then just give the doc id to TS.
j
Yup exactly
m
hmm... generating embedding per search term might be expensive with openai. So, probably more useful for doc similarity.
j
In the upcoming version of Typesense, we’re adding a way to automatically generate embeddings from within Typesense, so you don’t have to do that extra step both during indexing and search time. And you can then combine regular keyword search with semantic search in a single step
🤘 1
👍 2
🎉 1
m
That'd save a bunch of trouble.
j
Yeah for sure! We’re also adding an integration with OpenAI, so you can use their embedding models from within Typesense
m
But, I suppose openAI is better than other models that are available? So, there might be a quality difference?
j
Word on the street is that OpenAI embeddings model quality is not that great… which is surprising, but it might be use-case dependent
m
interesting.
which one do you recommend?
j
S-Bert is a good general purpose model. But I heard that Microsoft’s E5 Text Embedding Model is beating a lot of benchmarks…
Trying to find the tweet where I read these benchmarks…
m
says "instructor-xl" is the top
j
Interesting, haven’t heard of it yet
m
We’re also adding an integration with OpenAI,
How would that work? For each search request, you'd have to send a call to openAI apis?
j
Correct
We’re going to only enable it for non-prefix searches, meaning we’ll disable it for search-as-you-type use-cases, to avoid running up OpenAI bills
Plus, when you index a document, you’ll indicate which fields in the JSON document to concatenate and send over to OpenAI to generate embeddings with
m
yup -- basically, what I'm doing right now.
That integration would be great. And technically, you could even run your own E5 servers.
j
Yup! We’re also going to ship E5 with Typesense, so if you don’t want to make external API calls you can use that model natively
m
nice.
w
j
Great set of questions! Could you post this as a new thread, so it’s easier to find for future searchers?
w
yes let me do that now
👍 1
j
Hi @Jason Bosco, you mentioned "combine regular keyword search with semantic search" above, I wonder how Typesense is currently combining the return results from both sides
j
j
Thanks Jason! Looks like Keyword search is more important in determining the final rank!
j
Yup!