#community-help

Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense

TLDR Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.

Powered by Struct AI

8

2

1

225
4mo
Solved
Join the chat
Jul 11, 2023 (5 months ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:26 PM
Yes I'll create it

1

07:26
Bill
07:26 PM
The only way to limit the results is only by using vector_query?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:26 PM
When you use auto-embedding - correct
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:27 PM
Yes with auto embedding
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:27 PM
per_page and page might work, but it’s not going to go off of vector distance
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:27 PM
If i don't limit the results, I get all docs sorted by vec distance
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:28 PM
There is also a k parameter in vector_query, but that runs into the same issue - we need to add support for both
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:28 PM
Yes I read about the k parameter
07:29
Bill
07:29 PM
An other quick way to fix this issue is in frontend to remove items with vec distance eg. 0.7>
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:30 PM
Yeah that would work as a workaround
07:30
Jason
07:30 PM
But you bring up a good use-case, we definitely want to add support for k and distance_threshold
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:31 PM
Sure it will be very useful for auto embedding
07:37
Bill
07:37 PM
Where should I create the issue?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:50 PM

1

08:45
Bill
08:45 PM
What's the best way to search for multiple terms? For example, I want to find recipes with: apples, oranges, coffee. I tried with "q": "apples, oranges, coffee" but the results are not good. (The vec distance in coffee recipes is 0.7>)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:02 PM
The only way to do precise searches like that would be to use filter_by instead of / or in combination with vector search
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:47 PM
If I do a search with eg. "q": "internet" I get all docs that are similar to this meaning. But if I do a search with q=* and filter_by= "title:= internet OR title:= coffee OR title:= cars, I get only the results that have these terms in title. It's like it doesn't use the model and it uses the normal search
09:48
Bill
09:48 PM
1st code:
{
"searches": [
{
"q": "internet",
"collection": "products",
"query_by": "embedding",
"exclude_fields": "embedding",
"prefix": *false*,
"per_page": 250
}
]
}
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:49 PM
filter_by only does keyword-based filtering. Only the q parameter is used for vector searches
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:50 PM
2nd code:
{
"searches": [
{
"q": "*",
"collection": "products",
"filter_by": "title:= internet || title:= coffee || title:= cars",
"query_by": "embedding",
"exclude_fields": "embedding",
"prefix": *false*,
"per_page": 250
}
]
}
09:50
Bill
09:50 PM
So it doesn't work in my case
09:50
Bill
09:50 PM
Is there any other way to find similarity in multiple terms?
09:50
Bill
09:50 PM
using q input
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:51 PM
“Similarity” is fully defined the model
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:51 PM
I mean docs near the terms
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:52 PM
You could try breaking it out into 3 searches within a multi-search, one search per term (using the q parameter) and then aggregate the results client-side
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:52 PM
If i do a search with q="coffee, cars" I get the results but the vec distance of cars docs is wrong
09:53
Bill
09:53 PM
It's higher than it should be
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:54 PM
Typesense does not control that distance metric in the sense that the model is the one that takes search terms and projects them into vector space
09:54
Jason
09:54 PM
So if it returns unexpected results, then the model is probably not a good fit for what you’re trying to do
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:56 PM
Ok, i'll break it into 3 searches with multi search
10:53
Bill
10:53 PM
I’m trying to create a suggestion system. For example, the user has searched for terms Harry Potter , fast n furious, wolf of wall street . I store these values and in the next app launch the user has to get the suggestions (movies about magicians, about cars or about economics) based on these 3 previous searches. What would be the best way to do this with typesense?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:55 PM
To do user-level personalization based on search history, you would have to use a recommender model - send it data as users interact with the site. Then when a user lands on the site again, embed the user profile get the vectors then do a nearest neighbor search in Typesense to get recommended items to show them
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:38 PM
Yes that’s what I’m trying to do. I tried with multiple terms in q input parameter as I wrote you but no luck. So the only way is by doing 3-5 multi searches with the stored values (eg. Harry Potter etc). Can you do a nearest neighbor search with multiple terms?
Jul 12, 2023 (5 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:57 AM
Yeah, you can seperate them by spaces… but the entire sentence will be embedded together and not one-by-one
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:06 AM
Yes I tested it with spaces but the vec distance of the docs is wrong. It would be very useful to have more terms in q separated by special character (eg. ||)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:42 PM
Bill

I've a fix for the distance_threshold param in typesense/typesense:0.25.0.rc48.

You can now pass it like this:

'vector_query': 'vec:([], distance_threshold: 0.25)'
03:43
Kishore Nallan
03:43 PM
Please try this branch locally first or on a staging environment.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
07:26 PM
Yes it works now using 'vector_query': 'embedding:([], distance_threshold: 0.25)'. Is it production ready?
Jul 13, 2023 (4 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:23 AM
We are mostly on code freeze on this branch now, just ironing out last few blockers or bugs like this. So you can use it.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:18 AM
In the stable version of 0.25.0 will be breaking changes to these parts (auto embedding, vector query etc)?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:20 AM
No api is frozen. Barring these little usability quirks we find.

1

Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:16 AM
Is it better to use auto embedding to multiple inputs eg. Title, description or only at one (title)? I’m asking because if the description has no info about the title, the distance will be longer.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:55 AM
We currently support querying against only 1 embedding so you have to try and represent the product in a single piece of text which can be embedded.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:59 AM
I noticed an other model issue. I use the model: paraphrase-multilingual-mpnet-base-v2 and in languages like Greek if I search with q: "Προγραμματιστής", I get the doc with title "Προγραμματιστής" but the vector distance is 3.4724921e-733. If I search with q: "προγραμματιστής" the distance is correct (0.002341,,)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:01 AM
The models are basically a blackbox. Unless there is a systemic issue which maps all words wrongly, it's likely a quirk of the model. Non-English models are probably not that great.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:07 AM
So in this case the best solution is to lowercase the search term?
11:09
Bill
11:09 AM
Is this model supported by typesense -> https://huggingface.co/intfloat/multilingual-e5-large/tree/main? In onnx folder there are these files: config.json, model.onnx, model.onnx_data, sentencepiece.bpe.model
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:12 AM
It's not on our huggingface remote yet. But if you can convert it the model to onnx and put them on disk locally, you should be able to use it.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:12 AM
It is converted in onnx format. But it has model.onnx, model.onnx_data
01:48
Bill
01:48 PM
Should I edit the config file? It doesn't match the config files in typesense folder
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:50 PM
I'll have to look. It doesn't seem to conform to the conventions of existing models.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:50 PM
The config in typesense file is:
{
"model_md5": "728d3db98e1b7a691a731644867382c5",
"vocab_file_name": "sentencepiece.bpe.model",
"vocab_md5": "bf25eb5120ad92ef5c7d8596b5dc4046",
"model_type": "xlm_roberta"
}
01:50
Bill
01:50 PM
And in this files is:
{
"_name_or_path": "intfloat/multilingual-e5-large",
"architectures": [
"XLMRobertaModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.30.2",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}
01:51
Bill
01:51 PM
I tried to convert it to:
{
"vocab_file_name": "sentencepiece.bpe.model",
"model_type": "xlm_roberta"
}
But i get this error-> "message": "Failed to download model file"
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:52 PM
I'll get back to you tomorrow after investigation
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:52 PM
With schema:
{
"name": "productsNew",
"fields": [
{
"name": "product_name",
"type": "string"
},
{
"name": "embedding",
"type": "float[]",
"embed": {
"from": [
"product_name"
],
"model_config": {
"model_name": "ts/multilingual-e5-large"
}
}
}
]
}
01:52
Bill
01:52 PM
Ok
Jul 14, 2023 (4 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
05:48 AM
The onnx file in the model is setup differently from what we have from other models, so need to spend some time to see how to integrate this. We're currently tied up with the last stretch of tasks for the 0.25 release. We will only be able to look into this after that, perhaps in 10-12 days.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
09:54 AM
Ok, I’ve read that the most precise free nlp multilingual model right now is the paraphrase that you have already uploaded to typesense folder, so for now it’s ok. The only issue that has is with camelcased letters but I correct it in client side

1

11:40
Bill
11:40 AM
I created a new collection and I get this error in all nodes: "Not Ready or Lagging" and at random times 502: Bad gateway
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:46 AM
Are you indexing?
11:47
Kishore Nallan
11:47 AM
If writes exceed the write lag threshold server returns 502
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:47 AM
Yes I uploaded some docs. I have deployed it in a 2 CPU -4 GB RAM and when I created the collection with embedding model, after 10s I got a 503
11:47
Bill
11:47 AM
It;s like the nodes are not synced
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:48 AM
Cluster or single node? Which version of Typesense?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:48 AM
Using /debug I get random state: 4 and state: 0
11:48
Bill
11:48 AM
cluster
11:48
Bill
11:48 AM
v0.25.0.rc48
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:49 AM
Difficult to say what's happening without looking at logs
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:49 AM
Is there any way to reset the nodes or only by fresh install?
11:51
Bill
11:51 AM
I tried to have only 1 node but then the state is 4 instead of 1
12:01
Bill
12:01 PM
I also tried to Re-elect Leader but the state is still 4
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:05 PM
There's a reset peers api. You can try that. It will force the cluster to peer again
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
12:05 PM
Whats the api endpoint?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:31 PM
POST /operations/reset_peers
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
12:43 PM
It's still state: 4
12:44
Bill
12:44 PM
How can I do a fresh install?
12:45
Bill
12:45 PM
Delete all typesense's log folders?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:48 PM
You have to stop all nodes and delete the data directory and start back.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
12:48 PM
data dir and logs?
12:51
Bill
12:51 PM
I deleted var/logs/typesense, var/lib/typesense and the state is still 4
12:51
Bill
12:51 PM
*and sudo apt remove typesense-server
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:53 PM
Hard to help further without looking at the logs
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:24 PM
Ok fixed it. The issue was with the RAM on indexing. The model needs at least an 8gb to create the collection, then it can be resized to 4GB
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:48 PM
Ok glad to hear.
01:48
Kishore Nallan
01:48 PM
Which model is this?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:01 PM
I use the paraphrase
02:33
Bill
02:33 PM
How can I use "sort_by" : "vector_distance:asc, timestamp:desc" ? I get this error: "Could not find a field named vector_distance in the schema for sorting"
02:34
Bill
02:34 PM
If I have sort_by: timestamp:desc, I get results sorted only by timestamp and not by vector_distance
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:37 PM
Try _vector_distance:desc

1

Jul 19, 2023 (4 months ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
08:46 PM
Kishore Nallan I noticed an other issue with build 0.25.0.rc53. When the server reboots, the typesense server auto starts but the collection with embedding fields is not loaded automatically. The collection will be loaded to the memory only if there is a new request (multi_search with query_by ermbedding). As a result the first request is always error 500.
{
"results": [
{
"code": 500,
"error": "Request timed out."
}
]
}
08:52
Bill
08:52 PM
Also, i tested it to 2 different droplet sizes. 1) 4 cpu - 8gb ram, 2) 2 cpu - 4gb ram. In 1st configuration the memory consumption is at 1.1GB, in 2nd configuration the memory is at 2.2GB.
Jul 20, 2023 (4 months ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:22 PM
Kishore Nallan Jason any idea?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:33 PM
Are you saying that on restart, once the first search request hits the collection, that's when the collection is actually loaded?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:43 PM
Yes, that’s happening only if the collection has embedding field. On restart, the instance loads only the typesense server with other collections and the Ram is about 60-70mb. When I search with multi search query by embedding, I get error 500 on first request and then when the collection has been loaded the Ram is at 1.6GB. So it seems like it loads the model only after 1st query
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:46 PM
Super odd. Will take a look and get back to you. Late here now so, tomorrow.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:46 PM
Ok

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Discussion on Performance and Scalability for Multiple Term Search

Bill asks the best way for multi-term searches in a recommendation system they developed. Kishore Nallan suggested using embeddings and remote embedder or storing and averaging vectors. Despite testing several suggested solutions, Bill continued to face performance issues, leading to unresolved discussions about scalability and recommendation system performance.

3

105
1w

Integrating Semantic Search with Typesense

Krish wants to integrate a semantic search functionality with typesense but struggles with the limitations. Kishore Nallan provides resources, clarifications and workarounds to the raised issues.

6

75
11mo
Solved

Announcement: General Availability of Typesense v0.25.0

Jason announces release of Typesense v0.25.0, listing new features. Users express excitement and ask pertinent questions. Gorkem, Manuel, and Daniel commend the team for the new functionalities. Manish and Tugay share their positive experiences with Typesense. Jason and Kishore Nallan answer questions and thank users for their feedback.

170

24
3mo
Solved

Phrase Search Relevancy and Weights Fix

Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.

6

111
8mo
Solved

Resolving Multilingual Search Function in Typesense Software

Bill is having difficulty with multilingual search functionality in Typesense software. Developer Kishore Nallan suggested setting a language locale and provided a demo build. The build solution had some issues, and after multiple rounds of software updates and troubleshooting, the problem still persists.

2

89
25mo