Hello, can I use vector searching for multiple wor...
# community-help
b
Hello, can I use vector searching for multiple word embeddings? Eg. I want to find all near results to (king, castle, river). Is this possible or I can use only 1 word embedding?
k
You will send multiple words into the model and it will return a single embedding that encapsulates the meaning.
b
What's the best way to use word embeddings instead of OpenAI's? So in order to use it, I have to vectorize the terms (word embedding) and then to use the map to search for near results. Am I correct?
k
In Typesense 0.25 RC we automatically support good open source embedding models.
You have to just define an embedding field and configure it to be embedded based on other text fields and everything just works.
🙌 1
b
@Kishore Nallan that’s perfect! When this version will be available?
j
We’re on feature freeze for 0.25 and plan to release it in the next few weeks, pending any additional issues you find. But you can already use this feature with the latest RC build:
0.25.0.rc47
b
Is this version stable for production?
j
We’ve fixed all known issues with built-in embedding models… The remote embedding models (openAI specifically) has issues we’re working on
b
I need only the built in embedding models. Is there any other issue except from openAI?
In addition, the GPU support will be added to v0.25 or in the next one?
j
Is there any other issue except from openAI?
None that have been reported. The build has been otherwise stable with other users using it
In addition, the GPU support will be added to v0.25 or in the next one?
It’s already in 0.25
b
So gpu is supported right now? In 0.25.0.rc47
j
Correct
b
Perfect! Is the DEB package available?
There’s some additional setup required in terms of runtime dependencies for GPU support to work. Some caveats: • Only Nvidia GPUs are supported • You want to install CUDA and cuDNN following Nvidia’s instructions • You also want to install ONNX Runtime (Linux - C/C++ row here and place that
.so
file in the same directory as the typesense binary)
b
I need it for simple search, for example to give 1-3 terms and find near results. So in this case is GPU required?
j
No, you can just use CPU
GPU is useful when you have say more than 100K records and you want to speed up the embedding generation process during indexing for each of the records, when using the built-in models
b
My records will be Max 10-15K. Is it search performance affected by GPU or only by CPU?
j
GPU helps both during hybrid search (where given a search query, Typesense will generate embeddings for the search query and then do a nearest neighbor search) and also during indexing documents and creating embeddings for them using the built-in models.
But for 15K records, CPU should be sufficient in terms of speed. GPU might be expensive + unnecessary compared to the perf improvement it will give you
b
@Jason Bosco I deployed v0.25.0.rc46 in a test server but it doesn't work very well. I used both models (S-BERT & E5) but the results are wrong in other languages. If the documents (product_name) is in English and the term is in English too, it works great.
In other languages, such as Greek, the results are wrong. It's like it doesn't process the term.
j
Ah yeah, both S-BERT and E-5 are trained on English datasets
So they will not work in other languages
You want to look for a language-specific model…
Looks like there’s a GreekBERT model here: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 Could you give it a shot outside of Typesense and let me know how accurate it is?
b
Can I use custom models in typesense?
@Jason Bosco I tried the GreekBERT Model and the results are good. How can I use it in typesense?
j
We have a way to load custom models in Typesense… Need to document this. Let me get back to you in about 24 hours.
b
Ok 👍
b
Hi @Jason Bosco I downloaded the xlm_roberta model (https://huggingface.co/xlm-roberta-base) because it has more languages than only Greek. I downloaded the config.json file and the .onnx according to the docs that you sent me. In xlm_roberta there is no vocab file, is it required, or it's only for BERT models?
I tried to create a collection but I get this message - > "Invalid config file" . I have uploaded in var/lib/typesense/models/xlm_roberta the following files: config.json, model.onnx, sentencepiece.bpe.model,tokenizer.json from the link I attached
k
Have you tried this multi lingual model? https://huggingface.co/typesense/models/tree/main/paraphrase-multilingual-mpnet-base-v2 It should work for Greek.
It's built upon
xlm_roberta
b
Okay i'll test it
Distilbert is supported ?
k
No we don't support that. Not all BERT-like models work automatically well for semantic search. They might work for classification or summarization.
b
Okay I'll deploy the model you sent me and i'll inform you
What's the issue with this model -> https://huggingface.co/xlm-roberta-base ?
k
To use
paraphrase-multilingual-mpnet-base-v2
, you just need to do this:
Copy code
curl -k "<http://localhost:8108/collections>" -X POST -H "Content-Type: application/json" \                                                                                                                  130 ↵
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
  "name": "titles",
  "fields": [
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "points",
      "type": "int32"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}'
👍 1
xlm-roberta-base
is a plain masked model. These models must be then fine tuned specifically for a task, like semantic search. They don't work well without that fine tuning.
👍 1
That's based on
distilbert
b
Whats model would be the best in order to support words in multiple languages? For example, term: accountant -> I should get results for λογιστής (greek), contador (spanish) etc.
k
I'm not sure if the multi-lingual models project the words from different languages onto the same vector space.
Try with
paraphrase-multilingual-mpnet-base-v2
--> it might do what you want.
b
The meaning of these 3 words, that are in different language, should be in the same space
Yes I'm uploading it to my server
k
Actually no. According to Cohere, it does not do that. Check the multi lingual model they have, which is meant for these types of queries: https://docs.cohere.com/docs/multilingual-language-models
b
So there is no model right now for multilingual connection?
k
There are no open source model. If there is, we will add support.
b
Okay 👍
k
Microsoft just a couple of weeks back has produced multi lingual e5: https://huggingface.co/intfloat/multilingual-e5-small We have not looked into it yet though
But again these might not do the translation bit.
b
Is this model trained for semantic search?
k
Yes all e5 models are
b
So in order to use i have to convert it to onnx etc, right?
k
Yup
b
The multi_search in vectoring, supports all the values that are supported to a normal search? (__eval, sort_by etc)_
k
Yup 100%
b
Perfect!
How can I limit the results according to the vector_distance? For example I have added 45 products and I get all 45 items in results. I want to get only the most relative max 1 distance
k
Copy code
vec:([0.3,0.4,0.5], distance_threshold:0.01)
b
The payload is { "searches": [ { "collection": "products", "q": "garden", "query_by": "embedding", "prefix": false, "per_page": 100 } ] }
k
Distance can be a value between 0 (perfect match) and 2 (worst match) so when given a
distance_threshold
we ignore records whose distance is greater than this threshold value.
b
Where to add vec input in the payload?
Or refer to the gist
b
I get this error - > Field
vec
does not have a vector query index.
According to the new docs I create docs - > {"product_name": "ABCD","description": "This is some description text"}
products*
k
I don't follow. You can query by embedding field directly
If something doesn't work as expected please post a small reproduceable example.
b
I just uploaded the model and I the RAM usage is too high
Main PID: 2962 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 807.3M
Main PID: 3218 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 1.6G
Main PID: 5746 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 1.3G
k
Yeah the multi lingual models are large
b
What's the recommended RAM size for multilingual?
k
Check the size of the model file in the hugging face repo.
b
I deployed in a droplet of 8 GB RAM and it's ok now
RAM at 2.2GB
I get this error -> "Field
vec
does not have a vector query index." when I add the vector search in Payload
k
Post the vec field definition in schema
b
type: float[] ?
k
You need to define num_dim as well to make it a vector field. Please refer to the docs.
b
Copy code
{
  "name": "vec",
  "type": "float[]",
  "num_dim": 4
}
Now I get this error -> "error":"Field
vec
has been declared in the schema, but is not found in the document." in collections/products/documents/import?action=create
k
Do you have that field in the documents ingested?
b
My Schema:
Copy code
{
  "name": "products",
  "fields": [
    {
      "name": "title",
      "type": "string",
      "locale": "sr"
    },
    {
      "name": "vec",
      "type": "float[]",
      "num_dim": 4
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}
k
If you are looking to do auto embedding then you have to define the model config. For that, see the gist.
Hmm that looks fine for me. Afk will check when I return to keyboard
b
Yes I do auto embedding
In which gist are you referring to?
Your schema should look like this:
Copy code
{
  "name": "products",
  "fields": [
    {
      "name": "title",
      "type": "string",
      "locale": "sr"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}
b
But there is no vec field in this schema
j
You’d need to define an explicit vector field with num_dim only if you’re generating embeddings outside of Typesense and importing them in
If you’re using auto generated embeddings, then just setting the
embed
key in the field definition will generate and store the embeddings
b
Yes
j
You can rename
name: embedding
to
name: vec
if you need
b
So the filter for search will be: 'vector_query' : 'embedding:([0.96826, 0.94, 0.39557, 0.306488], k:100)'
j
Yup
k
Actually the query tokens will autoamtically be vectorized as well. So you don't need to send
vector_query
at all.
Just do
query_by=embedding
👍 1
b
Okay I’ll test it. Why the Ram is too high in this version? In 0.24.1 the average ram is 60-100mb. But in this I need minimum an 8gb instance and the average Ram is 2.2 GB. If I delete the other models will be lower?
@Kishore Nallan if I don’t send vector query I can’t limit the results. I get all the products
k
The embedding model must be held in memory
Can you try sending an empty array?
By limit, you mean using the vector_distance to pick only similar values right?
b
Yes
I want to get only lower than 0.7 distance
k
Check if
[ ]
works, otherwise, I will have to address this use case which we might not have accounted for.
b
Ok
I tested the "vector_query": "embedding:([0.3,0.4,0.5], distance_threshold:0.01)", but I get this error -> "error": "Query field
embedding
must have 768 dimensions.". Model: paraphrase-multilingual-mpnet-base-v2
j
Could you try
"vector_query": "embedding:([ ], distance_threshold:0.01)"
b
"error": "When a vector query value is empty, an
id
parameter must be present."
j
Ah ok, yeah we didn’t account for this use-case of setting vector query params like
distance_threshold
when used with auto-embedding. Could you create a GitHub issue using this template with a set of curl commands that replicates the issue?
b
Yes I'll create it
🙏 1
The only way to limit the results is only by using vector_query?
j
When you use auto-embedding - correct
b
Yes with auto embedding
j
per_page
and
page
might work, but it’s not going to go off of vector distance
b
If i don't limit the results, I get all docs sorted by vec distance
j
There is also a
k
parameter in vector_query, but that runs into the same issue - we need to add support for both
b
Yes I read about the k parameter
An other quick way to fix this issue is in frontend to remove items with vec distance eg. 0.7>
j
Yeah that would work as a workaround
But you bring up a good use-case, we definitely want to add support for k and distance_threshold
b
Sure it will be very useful for auto embedding
Where should I create the issue?
j
b
👍 1
What's the best way to search for multiple terms? For example, I want to find recipes with: apples, oranges, coffee. I tried with "q": "apples, oranges, coffee" but the results are not good. (The vec distance in coffee recipes is 0.7>)
j
The only way to do precise searches like that would be to use
filter_by
instead of / or in combination with vector search
b
If I do a search with eg. "q": "internet" I get all docs that are similar to this meaning. But if I do a search with q=* and filter_by= "title:= internet OR title:= coffee OR title:= cars, I get only the results that have these terms in title. It's like it doesn't use the model and it uses the normal search
1st code:
{
"searches": [
{
"q": "internet",
"collection": "products",
"query_by": "embedding",
"exclude_fields": "embedding",
"prefix": *false*,
"per_page": 250
}
]
}
j
filter_by only does keyword-based filtering. Only the
q
parameter is used for vector searches
b
2nd code:
{
"searches": [
{
"q": "*",
"collection": "products",
"filter_by": "title:= internet || title:= coffee || title:= cars",
"query_by": "embedding",
"exclude_fields": "embedding",
"prefix": *false*,
"per_page": 250
}
]
}
So it doesn't work in my case
Is there any other way to find similarity in multiple terms?
using q input
j
“Similarity” is fully defined the model
b
I mean docs near the terms
j
You could try breaking it out into 3 searches within a multi-search, one search per term (using the q parameter) and then aggregate the results client-side
b
If i do a search with q="coffee, cars" I get the results but the vec distance of cars docs is wrong
It's higher than it should be
j
Typesense does not control that distance metric in the sense that the model is the one that takes search terms and projects them into vector space
So if it returns unexpected results, then the model is probably not a good fit for what you’re trying to do
b
Ok, i'll break it into 3 searches with multi search
I’m trying to create a suggestion system. For example, the user has searched for terms Harry Potter , fast n furious, wolf of wall street . I store these values and in the next app launch the user has to get the suggestions (movies about magicians, about cars or about economics) based on these 3 previous searches. What would be the best way to do this with typesense?
j
To do user-level personalization based on search history, you would have to use a recommender model - send it data as users interact with the site. Then when a user lands on the site again, embed the user profile get the vectors then do a nearest neighbor search in Typesense to get recommended items to show them
b
Yes that’s what I’m trying to do. I tried with multiple terms in q input parameter as I wrote you but no luck. So the only way is by doing 3-5 multi searches with the stored values (eg. Harry Potter etc). Can you do a nearest neighbor search with multiple terms?
j
Yeah, you can seperate them by spaces… but the entire sentence will be embedded together and not one-by-one
b
Yes I tested it with spaces but the vec distance of the docs is wrong. It would be very useful to have more terms in q separated by special character (eg. ||)
k
@Bill I've a fix for the distance_threshold param in
typesense/typesense:0.25.0.rc48
. You can now pass it like this:
Copy code
'vector_query': 'vec:([], distance_threshold: 0.25)'
Please try this branch locally first or on a staging environment.
b
Yes it works now using 'vector_query': 'embedding:([], distance_threshold: 0.25)'. Is it production ready?
k
We are mostly on code freeze on this branch now, just ironing out last few blockers or bugs like this. So you can use it.
b
In the stable version of 0.25.0 will be breaking changes to these parts (auto embedding, vector query etc)?
k
No api is frozen. Barring these little usability quirks we find.
👍 1
b
Is it better to use auto embedding to multiple inputs eg. Title, description or only at one (title)? I’m asking because if the description has no info about the title, the distance will be longer.
k
We currently support querying against only 1 embedding so you have to try and represent the product in a single piece of text which can be embedded.
b
I noticed an other model issue. I use the model: paraphrase-multilingual-mpnet-base-v2 and in languages like Greek if I search with q: "Προγραμματιστής", I get the doc with title "Προγραμματιστής" but the vector distance is 3.4724921e-733. If I search with q: "προγραμματιστής" the distance is correct (0.002341,,)
k
The models are basically a blackbox. Unless there is a systemic issue which maps all words wrongly, it's likely a quirk of the model. Non-English models are probably not that great.
b
So in this case the best solution is to lowercase the search term?
Is this model supported by typesense -> https://huggingface.co/intfloat/multilingual-e5-large/tree/main? In onnx folder there are these files: config.json, model.onnx, model.onnx_data, sentencepiece.bpe.model
k
It's not on our huggingface remote yet. But if you can convert it the model to onnx and put them on disk locally, you should be able to use it.
b
It is converted in onnx format. But it has model.onnx, model.onnx_data
Should I edit the config file? It doesn't match the config files in typesense folder
k
I'll have to look. It doesn't seem to conform to the conventions of existing models.
b
The config in typesense file is: { "model_md5": "728d3db98e1b7a691a731644867382c5", "vocab_file_name": "sentencepiece.bpe.model", "vocab_md5": "bf25eb5120ad92ef5c7d8596b5dc4046", "model_type": "xlm_roberta" }
And in this files is: { "_name_or_path": "intfloat/multilingual-e5-large", "architectures": [ "XLMRobertaModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "xlm-roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.30.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 250002 }
I tried to convert it to: { "vocab_file_name": "sentencepiece.bpe.model", "model_type": "xlm_roberta" } But i get this error-> "message": "Failed to download model file"
k
I'll get back to you tomorrow after investigation
b
With schema: { "name": "productsNew", "fields": [ { "name": "product_name", "type": "string" }, { "name": "embedding", "type": "float[]", "embed": { "from": [ "product_name" ], "model_config": { "model_name": "ts/multilingual-e5-large" } } } ] }
Ok
k
The onnx file in the model is setup differently from what we have from other models, so need to spend some time to see how to integrate this. We're currently tied up with the last stretch of tasks for the 0.25 release. We will only be able to look into this after that, perhaps in 10-12 days.
b
Ok, I’ve read that the most precise free nlp multilingual model right now is the paraphrase that you have already uploaded to typesense folder, so for now it’s ok. The only issue that has is with camelcased letters but I correct it in client side
👍 1
I created a new collection and I get this error in all nodes: "Not Ready or Lagging" and at random times 502: Bad gateway
k
Are you indexing?
If writes exceed the write lag threshold server returns 502
b
Yes I uploaded some docs. I have deployed it in a 2 CPU -4 GB RAM and when I created the collection with embedding model, after 10s I got a 503
It;s like the nodes are not synced
k
Cluster or single node? Which version of Typesense?
b
Using /debug I get random state: 4 and state: 0
cluster
v0.25.0.rc48
k
Difficult to say what's happening without looking at logs
b
Is there any way to reset the nodes or only by fresh install?
I tried to have only 1 node but then the state is 4 instead of 1
I also tried to Re-elect Leader but the state is still 4
k
There's a reset peers api. You can try that. It will force the cluster to peer again
b
Whats the api endpoint?
k
POST /operations/reset_peers
b
It's still state: 4
How can I do a fresh install?
Delete all typesense's log folders?
k
You have to stop all nodes and delete the data directory and start back.
b
data dir and logs?
I deleted var/logs/typesense, var/lib/typesense and the state is still 4
*and sudo apt remove typesense-server
k
Hard to help further without looking at the logs
b
Ok fixed it. The issue was with the RAM on indexing. The model needs at least an 8gb to create the collection, then it can be resized to 4GB
k
Ok glad to hear.
Which model is this?
b
I use the paraphrase
How can I use "sort_by" : "vector_distance:asc, timestamp:desc" ? I get this error: "Could not find a field named
vector_distance
in the schema for sorting"
If I have sort_by: timestamp:desc, I get results sorted only by timestamp and not by vector_distance
k
Try
_vector_distance:desc
🙌 1
b
@Kishore Nallan I noticed an other issue with build 0.25.0.rc53. When the server reboots, the typesense server auto starts but the collection with embedding fields is not loaded automatically. The collection will be loaded to the memory only if there is a new request (multi_search with query_by ermbedding). As a result the first request is always error 500. { "results": [ { "code": 500, "error": "Request timed out." } ] }
Also, i tested it to 2 different droplet sizes. 1) 4 cpu - 8gb ram, 2) 2 cpu - 4gb ram. In 1st configuration the memory consumption is at 1.1GB, in 2nd configuration the memory is at 2.2GB.
@Kishore Nallan @Jason Bosco any idea?
k
Are you saying that on restart, once the first search request hits the collection, that's when the collection is actually loaded?
b
Yes, that’s happening only if the collection has embedding field. On restart, the instance loads only the typesense server with other collections and the Ram is about 60-70mb. When I search with multi search query by embedding, I get error 500 on first request and then when the collection has been loaded the Ram is at 1.6GB. So it seems like it loads the model only after 1st query
k
Super odd. Will take a look and get back to you. Late here now so, tomorrow.
b
Ok
k
@Bill I'm not able to reproduce this. Here's what I did: 1. Created a collection using the
e5-small
model and an auto-embedding field. 2. Indexed 100K documents into the collection 3. Once the import was done, I confirmed the collection document count and also noted the memory usage via the /metrics.json end-point 4. Stopped and started Typesense server 5. I'm hitting the metrics end-point and the documents are getting indexed and memory usage is increasing.
b
@Kishore Nallan I created a collection using the paraphrase model. All the other models (e5-small etc.) are small in size ~ 500mb so that's why you don't get a 500 error at first request. The paraphrase model is >1 GB. Could you reproduce with paraphrase-multilingual-mpnet-base-v2?
k
Just to re check: are you saying that if you restarted the server and just let it run for 30 mins, and you check the memory usage after that, it never loads the data or the model back at all?
Base line memory usage on a fresh instance is about 40-50MB
b
When you start the server, with the collection (using model paraphrase) the RAM consumption is about 40-50MB. That is normal
When you send the first request in the collection using multi search with embedding,it hasn't loaded the model (that's why the RAM at first is about 40-50MB at first), and it loads the model. Because the model is ~ 1.5GB it needs more time to load it into RAM so you get error 500 timeout.
So in other words, the collection with the model doesn't loaded on server start, thats why the RAM is 40-60 MB at first instead of 1.6GB that should be with the model loaded in RAM
The model will be loaded only at the first multi search request
k
Got it, let me try with that model in a bit. Though technically the code path is common for all models. Did you get the same result on multiple tries?
b
Yes I tested it multiple times
k
Ok will get back to you
b
Ok
k
Tried, but unable to reproduce. 🤔
Can you try this? https://gist.github.com/kishorenc/d63d96eb173cff4e80a51d35b828967a It uses a single document, but the issue should be the same if the document really isn't loaded on restart it should not matter how many documents are indexed.
It's perhaps possible that there was some issue with Huggingface in fetching the model around the time you accessed?
b
In this example what’s the memory consumption before and after the search query?
k
Ok hold on I briefly reproduced it. But memory consumption didn't change before/after search query. I will try to narrow it down now.
b
The issue is that the typesense server doesn’t start with 1.6GB Ram (loaded model) and it starts with 60mb
k
What exact version of TS you are using now?
b
0.25.0.rc53
k
The low memory wasn't an issue I saw. Anyway will post an update after further investigation.
b
Ok
To reproduce better: 1) create a collection with embedded paraphrase model, 2) index some docs (eg. 5). 3) restart the server 4) check Ram, it will be about 60mb.5) do a search request with multi search vector searching 6) you will get an error 500 timeout, 7) check memory, now the memory will be 1.6GB (loaded model), 8) do the same search query and it works now. So the issue is that on auto start it doesn’t load the model
k
At Step 4) I am already seeing memory increasing to 1.8 GB. I also see this in the logs very soon after restart:
Copy code
Loading model from disk: /tmp/data/models/paraphrase-multilingual-mpnet-base-v2/model.onnx
This is the step increases the memory usage. Do you see this log before making a search request?
If not, can you share the logs you get for the first 2 minutes after a restart?
b
Do you use the version rc53?
k
Yes
b
Ok I’ll test it again and I’ll send you the logs
I have deployed it in a 2cpu - 4 gb Ram. Is this an issue?
k
Copy code
I20230721 21:02:24.300542 130060 typesense_server_utils.cpp:331] Starting Typesense 0.25.0.rc53
...
I20230721 21:02:39.070822 130064 text_embedder.cpp:21] Loading model from disk: /tmp/data/models/paraphrase-multilingual-mpnet-base-v2/model.onnx
I'm trying this locally. See how the "loading model" log happens within 30 seconds of starting the server.
I did briefly see the timed out issue though but when I had more docs indexed. I will still chase that down, maybe its related.
I don't the 4 GB RAM is an issue. Will update what I find.
b
I checked the logs, there is no “Loading model from disk: /var/lib/typesense/models/paraphrase-multilingual-mpnet-base-v2/model.onnx”
Only appears after a search request
I20230721 15:36:01.550168 26330 typesense_server_utils.cpp:331] Starting Typesense 0.25.0.rc53
I20230721 15:36:01.550246 26330 typesense_server_utils.cpp:334] Typesense is using jemalloc.
I20230721 15:36:01.550557 26330 typesense_server_utils.cpp:384] Thread pool size: 16
I20230721 15:36:01.553516 26330 store.h:64] Initializing DB by opening state dir: /var/lib/typesense/db
I20230721 15:36:01.571373 26330 store.h:64] Initializing DB by opening state dir: /var/lib/typesense/meta
..........
I20230721 15:36:01.631971 26469 raft_server.cpp:508] Loading collections from disk...
.....
I20230721 15:36:01.913641 26469 collection_manager.cpp:301] Loaded 2 collection(s).
I20230721 15:36:01.913944 26469 collection_manager.cpp:305] Initializing batched indexer from snapshot state...
I20230721 15:36:01.913995 26469 batched_indexer.cpp:446] Restored 0 in-flight requests from snapshot.
I20230721 15:36:01.914005 26469 raft_server.cpp:515] Finished loading collections from disk.
W20230721 15:36:01.914573 26460 raft_server.cpp:591] Multi-node with no leader: refusing to reset peers.
I20230721 15:36:01.983656 26470 raft_server.h:288] Node starts following { leader_id=1.112.0.2:8107:8108, term=74, status=Follower receives message from new leader with the same term.}
I20230721 15:36:11.920372 26460 raft_server.cpp:564] Term: 74, last_index index: 42907, committed_index: 42907, known_applied_index: 42907, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 113332
k
Ah looks like it is loading collections from disk so probably restoring off a snapshot. Snapshot runs every one hour which persists the DB and compacts it. So perhaps it's that code path which triggers this behavior. I will try that too.
b
after the multi search request I get as response:
{
"results": [
{
"code": 500,
"error": "Request timed out."
}
]
}
k
Ok
b
After the request I get this in logs:
I20230721 15:52:01.187388 26367 text_embedder.cpp:21] Loading model from disk: /var/lib/typesense/models/paraphrase-multilingual-mpnet-base-v2/model.onnx
So what's the issue?
k
Found the issue and fixing it. Will share a build later today.
b
Ok
k
@Bill I've published
0.25.0.rc54
b
@Kishore Nallan Ok, I'll deploy it to my server
👍 1
I've found an other bug with version0.25.0.rc53
k
Shoot 😄
b
My payload in multi_search { "searches": [ { "collection": "books", "q": "sea", "query_by": "title, embedding", "exclude_fields": "embedding", "prefix": false, "vector_query": "embedding:([], distance_threshold: 0.30)", "per_page": "250", "sort_by": "location(0.0,0.0):asc", "page": 1 } ] }
When you add sort_by location the results are not sorted by geo_distance_meters asc.
The first item I get has "geo_distance_meters": { "location": 60098 }, and the 2nd "geo_distance_meters": { "location": 5315 }, Which is wrong because the 2nd item should be first
If you do a normal search without embedding -> "query_by": "title" and without vector_query. The results are correct
k
Is this an issue only with geo field or any sort by is neglected?
b
I tested it only with geo right now. I'll check with other values too.
Tested it with other values too (sort_by: datePublishedTimestamp). I think that when you use sort_by and query_by: embedding it doesn't work.
k
Got it, I will look.
👍 1
@Bill I'm not able to actually reproduce it. Would you able to share a small snippet? Here's what I tried:
Copy code
<http://localhost:8108/collections/docs/documents/search?q=the&query_by=title,embedding&x-typesense-api-key=abcd&sort_by=points:desc&include_fields=points&vector_query=embedding:([]>, distance_threshold:0.30)
It's returning me hits sorted descending by
points
accurately.
Perhaps it's only related to geo searches? Will be good to get a re-confirmation on that.
b
@Kishore Nallan did you try with geo search?
k
Yes just managed to reproduce with geopoint. Will post a fix later. Thanks.
No geo fileds seem okay. Issue seems to be only with sort by on geo
b
Ok, i'll check again with sortby timestamp
Yes geo fields are ok, the geo search doesn't work
k
Geo search meaning filter on geo fields? Post the payload please.
b
The payload is: { "searches": [ { "collection": "books", "q": "sea", "query_by": "title, embedding", "exclude_fields": "embedding", "prefix": false, "vector_query": "embedding:([], distance_threshold: 0.30)", "per_page": "250", "sort_by": "location(0.0,0.0):asc", "page": 1 } ] }
The results are not sorted as I wrote above. I get the following: "geo_distance_meters": { "location": 60098 }, and the 2nd "geo_distance_meters": { "location": 5315 },
k
Yes that's what I was able to reproduce. Sort by on geo field.
Sorry typo above. I meant NON geo fields are okay.
Typed as "no geo fields" 😞
Will shared a fixed build later today.
b
Ok
The issue about the sort_by datePublishedTimestamp (default_sorting_field) is more complex. If I do sort_by: datePublishedTimestamp:asc or sort_by: datePublishedTimestamp:desc I get the same results in the same order, it's like it doesn't count the "asc" or "desc", BUT if I use sort_by in a field that I hadn't set it as the default_sorting_field, it works as expected.
Could you reproduce it and use sort by in the default_sorting_field with query_by: eg. title, embedding?
k
Ok will checkt that
👍 1
Fixed the issue with geo query, but still looking into the other. Will share a build once that's done too.
b
Ok, I wait for the build
k
Ok please try with
0.25.0.rc56
I could not reproduce the issue with default sorting field, but I wonder if that's because it's fixed by the other change.
b
@Kishore Nallan I upgraded the sever to version rc56 and it works now! All issues fixed: 1) collection with embedding autoloaded, 2) sortby location geo, 3) sortby other value. 🙌
k
Awesome thank you for the feedback and help
👍 1
b
If I find anything else I’ll inform you
👍 1
@Kishore Nallan I did some load tests and the CPU requirements are too high. I have created a basic collection (with embedding - vectoring) and I did a search using multi_search. The results are: 1. 2vcpu - 4gb RAM -> total max concurrent reqs/sec 25 2. 4 vcpu - 8GB RAM -> total max concurrent reqs/sec 50 All these tests had 100% in all CPU cores. If you do a normal search without embedding, even with the 1GB ram - 1vcpu droplet I can have 50 concurrent reqs. Is vectoring search possible for scale without GPU?
k
Most of these models have 30-100 million parameters. So they are going to be pretty intensive unfortunately.
b
Have you tested it with GPU support? Could it handle more concurrent requests?
k
Yes, the GPU version is significantly faster.
However we have benchmarked only imports, and not queries.
b
Ok so for now the best way for scaling is multiple nodes (5) with multiple vCPUs, right?
k
Yes. We do have plans to improve this in future as quantized models become more available.
👍 1