Hey guys I want to figure out similar docs given a JSON doc typesense #community-help

Join Slack

Hey guys, I want to figure out similar docs, given...

# community-help

Manish Rai Jain

05/09/2023, 6:03 PM

Hey guys, I want to figure out similar docs, given a JSON doc. What's a good way to achieve that?

Manish Rai Jain

05/09/2023, 6:04 PM

Do I need to run a Sentence-BERT model as mentioned here: https://typesense.org/docs/0.24.0/api/vector-search.html#what-is-an-embedding ?

Jason Bosco

05/09/2023, 6:08 PM

Yup, Sentence-BERT + vector query would be the way to go

Jason Bosco

05/09/2023, 6:08 PM

https://github.com/typesense/showcase-ecommerce-store/blob/master/scripts/vector-generation/main.py

Jason Bosco

05/09/2023, 6:09 PM

Or of course you also use embeddings from OpenAI for eg, but will be more expensive

Manish Rai Jain

05/09/2023, 6:09 PM

Interesting. How do I get embeddings from open ai?

Jason Bosco

05/09/2023, 6:09 PM

https://platform.openai.com/docs/guides/embeddings/how-to-get-embeddings

Manish Rai Jain

05/09/2023, 6:10 PM

Interesting. For OpenAI route, do you reckon it would take JSON directly, or do I need to push it all as a single long text string?

Jason Bosco

05/09/2023, 6:11 PM

I’d recommend pushing one long string into OpenAI / S-Bert

👍 1

Jason Bosco

05/09/2023, 6:12 PM

*one long string per record

Jason Bosco

05/09/2023, 6:12 PM

For eg, notice how I concatenate product description, name, category and brand into one string before sending it to S-BERT for each product record here: https://github.com/typesense/showcase-ecommerce-store/blob/33e604a75b81f1de258ff185c775288b27db8335/scripts/vector-generation/main.py#L9

Manish Rai Jain

05/09/2023, 6:13 PM

I see. if I were to marshal JSON into a single string, do you reckon it's the same thing. Or, that might not work?

Manish Rai Jain

05/09/2023, 6:14 PM

{"key": "value", "key2": "value2", ...} style

Jason Bosco

05/09/2023, 6:20 PM

I’m not too sure, but I suspect the added punctuation might be noise and might eat into your token length?

Manish Rai Jain

05/09/2023, 6:21 PM

interesting. So, basically it's like a "bag of words" sort of approach.

Jason Bosco

05/09/2023, 6:21 PM

Right, that’s how I’ve been doing it

Jason Bosco

05/09/2023, 6:24 PM

This suggests that punctuation doesn’t affect BERT: https://aclanthology.org/2020.pam-1.15.pdf

Jason Bosco

05/09/2023, 6:24 PM

Not sure about S-BERT and OpenAI

Jason Bosco

05/09/2023, 6:25 PM

But the fact that at least some models are punctuation sensitive, might make a case for not adding JSON-like punctuation into the mix

Jason Bosco

05/09/2023, 6:25 PM

since the original training data might not have too many JSON representations

Manish Rai Jain

05/09/2023, 6:26 PM

right. makes sense. Thinking I'll have one line per chat. All separated by newlines.

👍 1

Jason Bosco

05/09/2023, 6:27 PM

Another thing to experiment with, given that usually longer pieces of text do better with semantic search: You could concatenate all messages from a single thread into one long string and generate embeddings for the full discussion thread…

Manish Rai Jain

05/09/2023, 6:27 PM

I suppose there's no way to tell wht's a "heading" etc.

Jason Bosco

05/09/2023, 6:28 PM

Correct, and that doesn’t usually matter when generating embeddings, because it’s all text

👍 1

Manish Rai Jain

05/09/2023, 6:29 PM

and once I have embeddings, I just stick them into the doc I sent to Typesense.

Jason Bosco

05/09/2023, 6:30 PM

Yup

Manish Rai Jain

05/09/2023, 6:31 PM

Found this in openAI code:

Manish Rai Jain

05/09/2023, 6:31 PM

// OpenAPI suggests replacing newlines (\n) in your input with a single space, as they // have observed inferior results when newlines are present. // E.g. // "The food was delicious and the waiter..."

Manish Rai Jain

05/09/2023, 6:31 PM

So, yeah, no newlines then 🙂

Jason Bosco

05/09/2023, 6:32 PM

I see, good to know!

Manish Rai Jain

05/09/2023, 6:43 PM

Wao, I'm seeing 1500 floats come out of the openAI embedding system.

Manish Rai Jain

05/09/2023, 6:43 PM

for a single thread. Is that something typesense can work with?

Jason Bosco

05/09/2023, 6:44 PM

Yup, you’d have to set the

num_dim

property in the field definitioin in the Typesense collection schema to the number of floats you see

Jason Bosco

05/09/2023, 6:44 PM

Think it’s

Manish Rai Jain

05/09/2023, 6:45 PM

could multiple docs have different number of floats? Yup, it's 1536

Jason Bosco

05/09/2023, 6:45 PM

No, a particular model will always generate the same number of floats for all types of input

Manish Rai Jain

05/09/2023, 6:45 PM

ahh.. interesting

Manish Rai Jain

05/09/2023, 6:46 PM

you're right. OpenAI site shows this: "text-embedding-ada-002 cl100k_base 8191 1536"

👍 1

Manish Rai Jain

05/09/2023, 6:46 PM

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

Manish Rai Jain

05/09/2023, 6:48 PM

does typesense use these for search as well as similarity?

Jason Bosco

05/09/2023, 6:48 PM

Typesense uses these any time you use the

vector_query

parameter… which you’d use for similarity search and semantic search

Manish Rai Jain

05/09/2023, 6:52 PM

Got it. So, I have a search term, I need to generate it's embeddings first as well, to do a similarity search. OTOH, if I want similar docs, then just give the doc id to TS.

Jason Bosco

05/09/2023, 6:52 PM

Yup exactly

Manish Rai Jain

05/09/2023, 6:53 PM

hmm... generating embedding per search term might be expensive with openai. So, probably more useful for doc similarity.

Jason Bosco

05/09/2023, 6:53 PM

In the upcoming version of Typesense, we’re adding a way to automatically generate embeddings from within Typesense, so you don’t have to do that extra step both during indexing and search time. And you can then combine regular keyword search with semantic search in a single step

🤘 1

👍 2

🎉 1

Manish Rai Jain

05/09/2023, 6:53 PM

That'd save a bunch of trouble.

Jason Bosco

05/09/2023, 6:54 PM

Yeah for sure! We’re also adding an integration with OpenAI, so you can use their embedding models from within Typesense

Manish Rai Jain

05/09/2023, 6:54 PM

But, I suppose openAI is better than other models that are available? So, there might be a quality difference?

Jason Bosco

05/09/2023, 6:54 PM

Word on the street is that OpenAI embeddings model quality is not that great… which is surprising, but it might be use-case dependent

Manish Rai Jain

05/09/2023, 6:55 PM

interesting.

Manish Rai Jain

05/09/2023, 6:55 PM

which one do you recommend?

Jason Bosco

05/09/2023, 6:57 PM

S-Bert is a good general purpose model. But I heard that Microsoft’s E5 Text Embedding Model is beating a lot of benchmarks…

Jason Bosco

05/09/2023, 6:57 PM

Trying to find the tweet where I read these benchmarks…

Manish Rai Jain

05/09/2023, 6:57 PM

came across this one: https://huggingface.co/spaces/mteb/leaderboard

Manish Rai Jain

05/09/2023, 6:57 PM

says "instructor-xl" is the top

Jason Bosco

05/09/2023, 6:58 PM

Interesting, haven’t heard of it yet

Manish Rai Jain

05/09/2023, 7:09 PM

We’re also adding an integration with OpenAI,

How would that work? For each search request, you'd have to send a call to openAI apis?

Jason Bosco

05/09/2023, 7:10 PM

Correct

Jason Bosco

05/09/2023, 7:10 PM

We’re going to only enable it for non-prefix searches, meaning we’ll disable it for search-as-you-type use-cases, to avoid running up OpenAI bills

Jason Bosco

05/09/2023, 7:11 PM

Plus, when you index a document, you’ll indicate which fields in the JSON document to concatenate and send over to OpenAI to generate embeddings with

Manish Rai Jain

05/09/2023, 7:11 PM

yup -- basically, what I'm doing right now.

Manish Rai Jain

05/09/2023, 7:12 PM

That integration would be great. And technically, you could even run your own E5 servers.

Jason Bosco

05/09/2023, 7:12 PM

Yup! We’re also going to ship E5 with Typesense, so if you don’t want to make external API calls you can use that model natively

Manish Rai Jain

05/09/2023, 7:13 PM

nice.

Walter Cavinaw

05/30/2023, 11:07 PM

moved to thread: https://typesense-community.slack.com/archives/C01P749MET0/p1685488719661879

Jason Bosco

05/30/2023, 11:17 PM

Great set of questions! Could you post this as a new thread, so it’s easier to find for future searchers?

Walter Cavinaw

05/30/2023, 11:18 PM

yes let me do that now

👍 1

Jerry Hu

07/31/2023, 12:59 AM

Hi @Jason Bosco, you mentioned "combine regular keyword search with semantic search" above, I wonder how Typesense is currently combining the return results from both sides

Jason Bosco

07/31/2023, 2:02 AM

We use Rank Fusion to combine results: https://gist.github.com/jasonbosco/f4187f6b4f585d2dc8902af85408994a#hybrid-search-rank-fusion

❤️ 1

Jerry Hu

07/31/2023, 6:03 PM

Thanks Jason! Looks like Keyword search is more important in determining the final rank!

Jason Bosco

07/31/2023, 6:03 PM

Yup!

Open in Slack

Previous Next