Hey there, I am looking at integrating a semantic ...
# community-help
k
Hey there, I am looking at integrating a semantic search functionality with typesense for the MVP I am working on. I have it coded in python and works in a standalone environment. Since typesense is in c++ I'm not sure how to integrate my existing python code with it to make use of it's in memory indexing. Could you share some thoughts and/or pointers?
k
Use the Python client to integrate with Typesense server.
k
OK, Let me try and explain what I am trying to do here. I am trying to implement semantic search using cosine similarity and this requires me to grab the search query and derive the cosine similarity with all the docs and return the highest ranking ones. Currently I have a small python script that does this against a json file for a search query. I already have a typesense instance running. My client is in nodejs and soon I would be updating the existing docs in the typesense server with the embeddings. But what I am not sure of is, on a search request how do I make my client to execute the python script on the docs and return the data from within the typesense server? I can still do this by grabbing all the docs from typesense and then running the cosine similarity on each to derive the best, but it'll be inefficient. I hope I was able to give clarity on my requirement/request..
k
You can't run a python script on the docs within Typesense directly. Typesense can already do nearest neighbour cosine similarity search on embeddings on 0.24 rc builds.
k
I am on 0.23.1 so, are you recommending an upgrade? So, if I have to use this functionality do I have to call out the fields I need the similarity calculated for using the 'query by' option?
k
Thank you. Questions: 1. Does the fied num_dim have any bearing? In my case I cannot say for sure what this number would be for each document. 2. What's the default sorting criteria in the schema? Does it tell Typesense to sort the records / documents based on the specified field so that whenever data is returned it is sorted not only on the nearest cosine similarity, but also on this field? 3. In search parameters, is it necessary for the client to send the query vectors?
k
1. For cosine distance, need to know the number of dimensions of upfront so
num_dim
must be common to all the docs. 2. Currently only possible to sort the results on distance from query vector. 3. You can send either query vector or a document ID whose field value should be used as reference query vector
k
OK. Let me play with the implementation a bit. Thank you Kishore.
👍 1
Do you'll have the 0,24rc docker image available yet?
k
k
which one do i go for rcn56?
k
Yes
👍 1
k
Any idea why have I been getting a bad request while searching ? Below is my curl command
curl -g -H "X-TYPESENSE-API-KEY: <my key>" "<http://localhost:8108/collections/semantic/documents/search?q=*&vector_query=vector:([-0.01622316800057888,0.0011516984086483717,-0.0028857849538326263,-0.011529190465807915,-0.0017157779075205326,-0.0015300762606784701,-0.013300766237080097,-0.03791450709104538,-0.011020036414265633,-0.01646030880510807,0.011229277588427067,-0.012414977885782719,-0.009652994573116302,-0.0007240617414936423,.....,0.005042713135480881,0.01222666073590517,-0.0012711402960121632,0.024090636521577835,-0.02943326346576214]>, k:100)"
the vector in the request is truncated
k
Use multi search endpoint
could you help with a sample query using the multi search endpoint please?
k
See the
shell
example in the link you have posted above. What problem are you facing?
k
The curl command pasted above always fails with an error "Bad request". I took a look at the shell example in the doc, but not sure how to do I use it for a vector query.. for example: would the query_by field refer to the vector field?
k
Yes that won't work because the GET method imposes a length restriction, which is why I pointed to multi search
I will have to update the vector search readme to account for this. Use multi_search
k
while you update, if you could send me some sample pointers for me to proceed with my experiment, it'd help
k
Vector query using multi search:
Copy code
curl '<http://localhost:8108/multi_search?collection=docs>' -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
--data-raw '{"searches":[{"q":"*", "vector_query": "vec:([0.96826,0.94,0.39557,0.306488])" }]}'
The 400 bad request should have returned an error like this:
Copy code
Query string exceeds max allowed length of 4000. Use the /multi_search end-point for larger payloads.
`Did it not?
k
No, it did not... it only returned "bad request"
k
Using Python client or curl?
k
message has been deleted
curl.
k
Can you try doing curl with the "-v" parameter?
k
sure, same query right?
k
Yes, what you are seeing on the screenshot is the http error code reason, not the actual returned body of the response.
k
this is with -v
k
Hmm ok. Can you post the full curl request? You can anonymize the host, collection name / other identifiable params.
k
sure
Attached, the vector field is enormously lengthy. Hence, pasted it in the file.
k
Ok, let me try running locally
👍 1
I think pasting via terminal is maybe messing up some stuff. When paste the URL into browser I get a proper response. Can you try that?
In any case, using multi_search will get around this issue. I will update the example in readme.
k
The multisearch query worked...
👍 1
i'll try in the browser and revert
Yes, in the browser it reported the appropriate error
Copy code
{
  "message": "Query string exceeds max allowed length of 4000. Use the /multi_search end-point for larger payloads."
}
k
Ok, yeah then it must be some terminal issue.
k
Cool. Thanks for the help. One other thing since you are around. The vector array for my records are massive (the length is around 1536 :)). When the search returns the hits, is there a way it could return only few fields from the records? (I am looking at omitting the vector field 'coz its annoying)
k
Yes, check up on
include_fields
and
exclude_fields
search params. They can control what fields are returned.
k
awesome!
thank you!
👍 1
@Kishore Nallan - got a question around this implementation. Is there a way I could do a multi-search query on a subset of the collection ?
basically what I am looking at doing is - when the user first sends the query it does the normal search (which in my case would a keyword search). On this result I would then like to do the multi-search using vector_query. Is this possible?
k
This is not possible yet, but we plan to support this in future. For now vector search can only work on a regular filter_by subset, and not on the results of a keyword search.
k
ok, any workaround you may want to suggest ? I was thinking of doing an intersection of the two results, but I think it might impact the turnaround time of the hits.
k
There are no easy work arounds here unfortunately.
k
OK. Hey I got a stupid idea, but will work. However, I cannot gauge the performance impact and/or turnaround time of the queries. I am hoping you could shed some light. What if I create a temp collection on the fly and import the hist of the keywd search and do a search using vector query on it?
k
That will be very slow
k
Hmm.... Would an intersection of the two be faster?
k
No, the only way to solve this is if we implemented this feature.
👍 1
k
@Kishore Nallan is there a way I could only send back results greater than a value for parameter 'vector_distance' of the hists? Btw, I am using 0.24.rc56 using vector based query. Something like filter_by=vector_distance:>0.25 ?
k
Nope that's not possible but these distances don't carry any semantic absolute meaning. They are only useful as relative values.
k
but the results are in descding order of the vector_distance
k
Yes, only as relative ordering it matters.
k
so, whats happening is its sending back all the documents, but I need the top x%
k
All documents? It only sends back
k
documents where
k
is either specified in your vector search query or if not sent, calculated as
page * per_page
-- with vector search you typically can't do pagination. So you have to just fetch 500 - 1000 results and paginate on client side if needed.
k
i am guessing its also sending back based on the consine simialrity, right (most likely desc order)
k
Cosine similarity, yes.
k
ok! My per_page is 200 and it's sending me all 200 (which makes sense), but the top 75/100 are actually useful the rest are really low on similarity (i am guessing based ont he results)
So, how could I retrieve the top ones based some some mathematic assessment like > x%?
doing it on the client would be painful, you can imagine
k
There are no similarity measurements as percentage possible. It's a distance and absolute value has no meaning.
k
any recommendations?
k
But cosine distance is scaled between 0 to 1. But I practice I've not found a threshold to be really useful.
k
i am having a hard time too, but from the results it appears that any thing higher than 0.25 is decent
k
Can you create an issue on Github for this? We will consider as feature request.
k
Awesome I will, but any hacks for the time being ? 🙂
k
Nope, only client side filtering.
k
Hmmm... Ok. But that'll be painful at the client end.