Hey there I am looking at integrating a semantic search func typesense #community-help

Hey there, I am looking at integrating a semantic ...

Krish

01/12/2023, 4:57 AM

Hey there, I am looking at integrating a semantic search functionality with typesense for the MVP I am working on. I have it coded in python and works in a standalone environment. Since typesense is in c++ I'm not sure how to integrate my existing python code with it to make use of it's in memory indexing. Could you share some thoughts and/or pointers?

Kishore Nallan

01/12/2023, 5:06 AM

Use the Python client to integrate with Typesense server.

Krish

01/12/2023, 5:45 AM

OK, Let me try and explain what I am trying to do here. I am trying to implement semantic search using cosine similarity and this requires me to grab the search query and derive the cosine similarity with all the docs and return the highest ranking ones. Currently I have a small python script that does this against a json file for a search query. I already have a typesense instance running. My client is in nodejs and soon I would be updating the existing docs in the typesense server with the embeddings. But what I am not sure of is, on a search request how do I make my client to execute the python script on the docs and return the data from within the typesense server? I can still do this by grabbing all the docs from typesense and then running the cosine similarity on each to derive the best, but it'll be inefficient. I hope I was able to give clarity on my requirement/request..

Kishore Nallan

01/12/2023, 6:09 AM

You can't run a python script on the docs within Typesense directly. Typesense can already do nearest neighbour cosine similarity search on embeddings on 0.24 rc builds.

Krish

01/12/2023, 6:13 AM

I am on 0.23.1 so, are you recommending an upgrade? So, if I have to use this functionality do I have to call out the fields I need the similarity calculated for using the 'query by' option?

Kishore Nallan

01/12/2023, 6:22 AM

Check usage here: https://github.com/typesense/typesense-website/blob/v0.24.0/docs-site/content/0.24.0/api/vector-search.md#vector-search

Krish

01/12/2023, 6:51 AM

Thank you. Questions: 1. Does the fied num_dim have any bearing? In my case I cannot say for sure what this number would be for each document. 2. What's the default sorting criteria in the schema? Does it tell Typesense to sort the records / documents based on the specified field so that whenever data is returned it is sorted not only on the nearest cosine similarity, but also on this field? 3. In search parameters, is it necessary for the client to send the query vectors?

Kishore Nallan

01/12/2023, 6:54 AM

1. For cosine distance, need to know the number of dimensions of upfront so

num_dim

must be common to all the docs. 2. Currently only possible to sort the results on distance from query vector. 3. You can send either query vector or a document ID whose field value should be used as reference query vector

Krish

01/12/2023, 7:34 AM

OK. Let me play with the implementation a bit. Thank you Kishore.

👍 1

Krish

01/12/2023, 9:53 AM

Do you'll have the 0,24rc docker image available yet?

Kishore Nallan

01/12/2023, 9:54 AM

https://hub.docker.com/r/typesense/typesense/tags

Krish

01/12/2023, 9:55 AM

which one do i go for rcn56?

Kishore Nallan

01/12/2023, 9:55 AM

Yes

👍 1

Krish

01/12/2023, 3:47 PM

Any idea why have I been getting a bad request while searching ? Below is my curl command

curl -g -H "X-TYPESENSE-API-KEY: <my key>" "<http://localhost:8108/collections/semantic/documents/search?q=*&vector_query=vector:([-0.01622316800057888,0.0011516984086483717,-0.0028857849538326263,-0.011529190465807915,-0.0017157779075205326,-0.0015300762606784701,-0.013300766237080097,-0.03791450709104538,-0.011020036414265633,-0.01646030880510807,0.011229277588427067,-0.012414977885782719,-0.009652994573116302,-0.0007240617414936423,.....,0.005042713135480881,0.01222666073590517,-0.0012711402960121632,0.024090636521577835,-0.02943326346576214]>, k:100)"

Krish

01/12/2023, 3:47 PM

the vector in the request is truncated

Kishore Nallan

01/12/2023, 4:17 PM

Use multi search endpoint

Krish

01/12/2023, 4:20 PM

is it https://typesense.org/docs/0.23.1/api/federated-multi-search.html#multi-search-parameters?

Krish

01/13/2023, 5:14 AM

could you help with a sample query using the multi search endpoint please?

Kishore Nallan

01/13/2023, 5:17 AM

See the

shell

example in the link you have posted above. What problem are you facing?

Krish

01/13/2023, 5:19 AM

The curl command pasted above always fails with an error "Bad request". I took a look at the shell example in the doc, but not sure how to do I use it for a vector query.. for example: would the query_by field refer to the vector field?

Kishore Nallan

01/13/2023, 5:21 AM

Yes that won't work because the GET method imposes a length restriction, which is why I pointed to multi search

Kishore Nallan

01/13/2023, 5:22 AM

I will have to update the vector search readme to account for this. Use multi_search

Krish

01/13/2023, 5:23 AM

while you update, if you could send me some sample pointers for me to proceed with my experiment, it'd help

Kishore Nallan

01/13/2023, 5:23 AM

Vector query using multi search:

Copy code

curl '<http://localhost:8108/multi_search?collection=docs>' -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
--data-raw '{"searches":[{"q":"*", "vector_query": "vec:([0.96826,0.94,0.39557,0.306488])" }]}'

Kishore Nallan

01/13/2023, 5:24 AM

The 400 bad request should have returned an error like this:

Copy code

Query string exceeds max allowed length of 4000. Use the /multi_search end-point for larger payloads.

`Did it not?

Krish

01/13/2023, 5:25 AM

No, it did not... it only returned "bad request"

Kishore Nallan

01/13/2023, 5:25 AM

Using Python client or curl?

Krish

01/13/2023, 5:25 AM

message has been deleted

Krish

01/13/2023, 5:25 AM

curl.

Kishore Nallan

01/13/2023, 5:26 AM

Can you try doing curl with the "-v" parameter?

Krish

01/13/2023, 5:26 AM

sure, same query right?

Kishore Nallan

01/13/2023, 5:27 AM

Yes, what you are seeing on the screenshot is the http error code reason, not the actual returned body of the response.

Krish

01/13/2023, 5:27 AM

this is with -v

Kishore Nallan

01/13/2023, 5:28 AM

Hmm ok. Can you post the full curl request? You can anonymize the host, collection name / other identifiable params.

Krish

01/13/2023, 5:28 AM

sure

Krish

01/13/2023, 5:31 AM

Attached, the vector field is enormously lengthy. Hence, pasted it in the file.

search-vectorquery-curl.txt

Kishore Nallan

01/13/2023, 5:33 AM

Ok, let me try running locally

👍 1

Kishore Nallan

01/13/2023, 5:38 AM

I think pasting via terminal is maybe messing up some stuff. When paste the URL into browser I get a proper response. Can you try that?

Kishore Nallan

01/13/2023, 5:39 AM

In any case, using multi_search will get around this issue. I will update the example in readme.

Krish

01/13/2023, 5:39 AM

The multisearch query worked...

👍 1

Krish

01/13/2023, 5:40 AM

i'll try in the browser and revert

Krish

01/13/2023, 5:43 AM

Yes, in the browser it reported the appropriate error

Copy code

{
  "message": "Query string exceeds max allowed length of 4000. Use the /multi_search end-point for larger payloads."
}

Kishore Nallan

01/13/2023, 5:43 AM

Ok, yeah then it must be some terminal issue.

Krish

01/13/2023, 5:46 AM

Cool. Thanks for the help. One other thing since you are around. The vector array for my records are massive (the length is around 1536 :)). When the search returns the hits, is there a way it could return only few fields from the records? (I am looking at omitting the vector field 'coz its annoying)

Kishore Nallan

01/13/2023, 5:47 AM

Yes, check up on

include_fields

and

exclude_fields

search params. They can control what fields are returned.

Krish

01/13/2023, 5:47 AM

awesome!

Krish

01/13/2023, 5:47 AM

thank you!

👍 1

Krish

01/18/2023, 6:41 AM

@Kishore Nallan - got a question around this implementation. Is there a way I could do a multi-search query on a subset of the collection ?

Krish

01/18/2023, 6:43 AM

basically what I am looking at doing is - when the user first sends the query it does the normal search (which in my case would a keyword search). On this result I would then like to do the multi-search using vector_query. Is this possible?

Kishore Nallan

01/18/2023, 6:45 AM

This is not possible yet, but we plan to support this in future. For now vector search can only work on a regular filter_by subset, and not on the results of a keyword search.

Krish

01/18/2023, 6:46 AM

ok, any workaround you may want to suggest ? I was thinking of doing an intersection of the two results, but I think it might impact the turnaround time of the hits.

Kishore Nallan

01/18/2023, 7:11 AM

There are no easy work arounds here unfortunately.

Krish

01/18/2023, 7:48 AM

OK. Hey I got a stupid idea, but will work. However, I cannot gauge the performance impact and/or turnaround time of the queries. I am hoping you could shed some light. What if I create a temp collection on the fly and import the hist of the keywd search and do a search using vector query on it?

Kishore Nallan

01/18/2023, 7:49 AM

That will be very slow

Krish

01/18/2023, 7:49 AM

Hmm.... Would an intersection of the two be faster?

Kishore Nallan

01/18/2023, 3:03 PM

No, the only way to solve this is if we implemented this feature.

👍 1

Krish

01/23/2023, 2:48 PM

@Kishore Nallan is there a way I could only send back results greater than a value for parameter 'vector_distance' of the hists? Btw, I am using 0.24.rc56 using vector based query. Something like filter_by=vector_distance:>0.25 ?

Kishore Nallan

01/23/2023, 2:50 PM

Nope that's not possible but these distances don't carry any semantic absolute meaning. They are only useful as relative values.

Krish

01/23/2023, 2:50 PM

but the results are in descding order of the vector_distance

Kishore Nallan

01/23/2023, 2:51 PM

Yes, only as relative ordering it matters.

Krish

01/23/2023, 2:51 PM

so, whats happening is its sending back all the documents, but I need the top x%

Kishore Nallan

01/23/2023, 2:52 PM

All documents? It only sends back

documents where

is either specified in your vector search query or if not sent, calculated as

page * per_page

-- with vector search you typically can't do pagination. So you have to just fetch 500 - 1000 results and paginate on client side if needed.

Krish

01/23/2023, 2:52 PM

i am guessing its also sending back based on the consine simialrity, right (most likely desc order)

Kishore Nallan

01/23/2023, 2:52 PM

Cosine similarity, yes.

Krish

01/23/2023, 2:54 PM

ok! My per_page is 200 and it's sending me all 200 (which makes sense), but the top 75/100 are actually useful the rest are really low on similarity (i am guessing based ont he results)

Krish

01/23/2023, 2:55 PM

So, how could I retrieve the top ones based some some mathematic assessment like > x%?

Krish

01/23/2023, 2:55 PM

doing it on the client would be painful, you can imagine

Kishore Nallan

01/23/2023, 2:55 PM

There are no similarity measurements as percentage possible. It's a distance and absolute value has no meaning.

Krish

01/23/2023, 2:56 PM

any recommendations?

Kishore Nallan

01/23/2023, 2:57 PM

But cosine distance is scaled between 0 to 1. But I practice I've not found a threshold to be really useful.

Krish

01/23/2023, 2:57 PM

i am having a hard time too, but from the results it appears that any thing higher than 0.25 is decent

Kishore Nallan

01/23/2023, 3:04 PM

Can you create an issue on Github for this? We will consider as feature request.

Krish

01/23/2023, 3:04 PM

Awesome I will, but any hacks for the time being ? 🙂

Kishore Nallan

01/23/2023, 3:11 PM

Nope, only client side filtering.

Krish

01/23/2023, 3:13 PM

Hmmm... Ok. But that'll be painful at the client end.

Open in Slack

Previous Next