Hello can I use vector searching for multiple word embedding typesense #community-help

Hello, can I use vector searching for multiple wor...

Bill

07/10/2023, 1:24 PM

Hello, can I use vector searching for multiple word embeddings? Eg. I want to find all near results to (king, castle, river). Is this possible or I can use only 1 word embedding?

Kishore Nallan

07/10/2023, 1:25 PM

You will send multiple words into the model and it will return a single embedding that encapsulates the meaning.

Bill

07/10/2023, 1:28 PM

What's the best way to use word embeddings instead of OpenAI's? So in order to use it, I have to vectorize the terms (word embedding) and then to use the map to search for near results. Am I correct?

Kishore Nallan

07/10/2023, 3:02 PM

In Typesense 0.25 RC we automatically support good open source embedding models.

Kishore Nallan

07/10/2023, 3:03 PM

You have to just define an embedding field and configure it to be embedded based on other text fields and everything just works.

🙌 1

Kishore Nallan

07/10/2023, 3:03 PM

See here: https://gist.github.com/jasonbosco/f4187f6b4f585d2dc8902af85408994a

Bill

07/10/2023, 3:22 PM

@Kishore Nallan that’s perfect! When this version will be available?

Jason Bosco

07/10/2023, 3:53 PM

We’re on feature freeze for 0.25 and plan to release it in the next few weeks, pending any additional issues you find. But you can already use this feature with the latest RC build:

0.25.0.rc47

Bill

07/10/2023, 3:54 PM

Is this version stable for production?

Jason Bosco

07/10/2023, 4:20 PM

We’ve fixed all known issues with built-in embedding models… The remote embedding models (openAI specifically) has issues we’re working on

Bill

07/10/2023, 5:33 PM

I need only the built in embedding models. Is there any other issue except from openAI?

Bill

07/10/2023, 5:39 PM

In addition, the GPU support will be added to v0.25 or in the next one?

Jason Bosco

07/10/2023, 6:03 PM

Is there any other issue except from openAI?

None that have been reported. The build has been otherwise stable with other users using it

Jason Bosco

07/10/2023, 6:03 PM

In addition, the GPU support will be added to v0.25 or in the next one?

It’s already in 0.25

Bill

07/10/2023, 6:04 PM

So gpu is supported right now? In 0.25.0.rc47

Jason Bosco

07/10/2023, 6:04 PM

Correct

Bill

07/10/2023, 6:05 PM

Perfect! Is the DEB package available?

Jason Bosco

07/10/2023, 6:07 PM

https://dl.typesense.org/releases/0.25.0.rc46/typesense-server-0.25.0.rc46-amd64.deb

Jason Bosco

07/10/2023, 6:10 PM

There’s some additional setup required in terms of runtime dependencies for GPU support to work. Some caveats: • Only Nvidia GPUs are supported • You want to install CUDA and cuDNN following Nvidia’s instructions • You also want to install ONNX Runtime (Linux - C/C++ row here and place that

.so

file in the same directory as the typesense binary)

Bill

07/10/2023, 6:11 PM

I need it for simple search, for example to give 1-3 terms and find near results. So in this case is GPU required?

Jason Bosco

07/10/2023, 6:13 PM

No, you can just use CPU

Jason Bosco

07/10/2023, 6:14 PM

GPU is useful when you have say more than 100K records and you want to speed up the embedding generation process during indexing for each of the records, when using the built-in models

Bill

07/10/2023, 6:15 PM

My records will be Max 10-15K. Is it search performance affected by GPU or only by CPU?

Jason Bosco

07/10/2023, 6:18 PM

GPU helps both during hybrid search (where given a search query, Typesense will generate embeddings for the search query and then do a nearest neighbor search) and also during indexing documents and creating embeddings for them using the built-in models.

Jason Bosco

07/10/2023, 6:18 PM

But for 15K records, CPU should be sufficient in terms of speed. GPU might be expensive + unnecessary compared to the perf improvement it will give you

Bill

07/10/2023, 7:21 PM

@Jason Bosco I deployed v0.25.0.rc46 in a test server but it doesn't work very well. I used both models (S-BERT & E5) but the results are wrong in other languages. If the documents (product_name) is in English and the term is in English too, it works great.

Bill

07/10/2023, 7:22 PM

In other languages, such as Greek, the results are wrong. It's like it doesn't process the term.

Jason Bosco

07/10/2023, 7:22 PM

Ah yeah, both S-BERT and E-5 are trained on English datasets

Jason Bosco

07/10/2023, 7:22 PM

So they will not work in other languages

Jason Bosco

07/10/2023, 7:23 PM

You want to look for a language-specific model…

Jason Bosco

07/10/2023, 7:25 PM

Looks like there’s a GreekBERT model here: https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1 Could you give it a shot outside of Typesense and let me know how accurate it is?

Bill

07/10/2023, 7:49 PM

Can I use custom models in typesense?

Bill

07/10/2023, 8:04 PM

@Jason Bosco I tried the GreekBERT Model and the results are good. How can I use it in typesense?

Jason Bosco

07/10/2023, 8:36 PM

We have a way to load custom models in Typesense… Need to document this. Let me get back to you in about 24 hours.

Bill

07/10/2023, 8:38 PM

Ok 👍

Jason Bosco

07/11/2023, 3:24 AM

Here you go: https://gist.github.com/jasonbosco/f4187f6b4f585d2dc8902af85408994a#using-your-own-custom-models

Bill

07/11/2023, 11:00 AM

Hi @Jason Bosco I downloaded the xlm_roberta model (https://huggingface.co/xlm-roberta-base) because it has more languages than only Greek. I downloaded the config.json file and the .onnx according to the docs that you sent me. In xlm_roberta there is no vocab file, is it required, or it's only for BERT models?

Bill

07/11/2023, 12:12 PM

I tried to create a collection but I get this message - > "Invalid config file" . I have uploaded in var/lib/typesense/models/xlm_roberta the following files: config.json, model.onnx, sentencepiece.bpe.model,tokenizer.json from the link I attached

Kishore Nallan

07/11/2023, 12:16 PM

Have you tried this multi lingual model? https://huggingface.co/typesense/models/tree/main/paraphrase-multilingual-mpnet-base-v2 It should work for Greek.

Kishore Nallan

07/11/2023, 12:17 PM

It's built upon

xlm_roberta

Bill

07/11/2023, 12:17 PM

Okay i'll test it

Bill

07/11/2023, 12:17 PM

Distilbert is supported ?

Kishore Nallan

07/11/2023, 12:18 PM

No we don't support that. Not all BERT-like models work automatically well for semantic search. They might work for classification or summarization.

Bill

07/11/2023, 12:18 PM

Okay I'll deploy the model you sent me and i'll inform you

Bill

07/11/2023, 12:19 PM

What's the issue with this model -> https://huggingface.co/xlm-roberta-base ?

Kishore Nallan

07/11/2023, 12:19 PM

To use

paraphrase-multilingual-mpnet-base-v2

, you just need to do this:

Copy code

curl -k "<http://localhost:8108/collections>" -X POST -H "Content-Type: application/json" \                                                                                                                  130 ↵
      -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
  "name": "titles",
  "fields": [
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "points",
      "type": "int32"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}'

👍 1

Kishore Nallan

07/11/2023, 12:20 PM

xlm-roberta-base

is a plain masked model. These models must be then fine tuned specifically for a task, like semantic search. They don't work well without that fine tuning.

👍 1

Kishore Nallan

07/11/2023, 12:21 PM

We also support: https://huggingface.co/typesense/models/tree/main/distiluse-base-multilingual-cased-v2

Kishore Nallan

07/11/2023, 12:21 PM

That's based on

distilbert

Bill

07/11/2023, 12:23 PM

Whats model would be the best in order to support words in multiple languages? For example, term: accountant -> I should get results for λογιστής (greek), contador (spanish) etc.

Kishore Nallan

07/11/2023, 12:25 PM

I'm not sure if the multi-lingual models project the words from different languages onto the same vector space.

Kishore Nallan

07/11/2023, 12:26 PM

Try with

paraphrase-multilingual-mpnet-base-v2

--> it might do what you want.

Bill

07/11/2023, 12:26 PM

The meaning of these 3 words, that are in different language, should be in the same space

Bill

07/11/2023, 12:26 PM

Yes I'm uploading it to my server

Kishore Nallan

07/11/2023, 12:27 PM

Actually no. According to Cohere, it does not do that. Check the multi lingual model they have, which is meant for these types of queries: https://docs.cohere.com/docs/multilingual-language-models

Bill

07/11/2023, 12:29 PM

So there is no model right now for multilingual connection?

Kishore Nallan

07/11/2023, 12:29 PM

There are no open source model. If there is, we will add support.

Bill

07/11/2023, 12:30 PM

Okay 👍

Kishore Nallan

07/11/2023, 12:31 PM

Microsoft just a couple of weeks back has produced multi lingual e5: https://huggingface.co/intfloat/multilingual-e5-small We have not looked into it yet though

Kishore Nallan

07/11/2023, 12:32 PM

But again these might not do the translation bit.

Bill

07/11/2023, 12:34 PM

Is this model trained for semantic search?

Kishore Nallan

07/11/2023, 12:34 PM

Yes all e5 models are

Bill

07/11/2023, 12:35 PM

So in order to use i have to convert it to onnx etc, right?

Kishore Nallan

07/11/2023, 12:36 PM

Yup

Bill

07/11/2023, 12:40 PM

The multi_search in vectoring, supports all the values that are supported to a normal search? (__eval, sort_by etc)_

Kishore Nallan

07/11/2023, 12:42 PM

Yup 100%

Bill

07/11/2023, 12:42 PM

Perfect!

Bill

07/11/2023, 12:43 PM

How can I limit the results according to the vector_distance? For example I have added 45 products and I get all 45 items in results. I want to get only the most relative max 1 distance

Kishore Nallan

07/11/2023, 12:47 PM

Copy code

vec:([0.3,0.4,0.5], distance_threshold:0.01)

Bill

07/11/2023, 12:49 PM

The payload is { "searches": [ { "collection": "products", "q": "garden", "query_by": "embedding", "prefix": false, "per_page": 100 } ] }

Kishore Nallan

07/11/2023, 12:49 PM

Distance can be a value between 0 (perfect match) and 2 (worst match) so when given a

distance_threshold

we ignore records whose distance is greater than this threshold value.

Bill

07/11/2023, 12:49 PM

Where to add vec input in the payload?

Kishore Nallan

07/11/2023, 12:50 PM

https://typesense.org/docs/0.24.1/api/vector-search.html#nearest-neighbor-vector-search

👍 1

Kishore Nallan

07/11/2023, 12:50 PM

Or refer to the gist

Bill

07/11/2023, 12:51 PM

I get this error - > Field

vec

does not have a vector query index.

Bill

07/11/2023, 12:52 PM

According to the new docs I create docs - > {"product_name": "ABCD","description": "This is some description text"}

Bill

07/11/2023, 12:52 PM

products*

Kishore Nallan

07/11/2023, 12:53 PM

I don't follow. You can query by embedding field directly

Kishore Nallan

07/11/2023, 12:53 PM

If something doesn't work as expected please post a small reproduceable example.

Bill

07/11/2023, 1:05 PM

I just uploaded the model and I the RAM usage is too high

Bill

07/11/2023, 1:05 PM

Main PID: 2962 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 807.3M

Bill

07/11/2023, 1:05 PM

Main PID: 3218 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 1.6G

Bill

07/11/2023, 1:06 PM

Main PID: 5746 (typesense-serve) Tasks: 74 (limit: 2323) Memory: 1.3G

Kishore Nallan

07/11/2023, 1:13 PM

Yeah the multi lingual models are large

Bill

07/11/2023, 1:14 PM

What's the recommended RAM size for multilingual?

Kishore Nallan

07/11/2023, 1:38 PM

Check the size of the model file in the hugging face repo.

Bill

07/11/2023, 1:43 PM

I deployed in a droplet of 8 GB RAM and it's ok now

Bill

07/11/2023, 1:44 PM

RAM at 2.2GB

Bill

07/11/2023, 1:45 PM

I get this error -> "Field

vec

does not have a vector query index." when I add the vector search in Payload

Kishore Nallan

07/11/2023, 1:46 PM

Post the vec field definition in schema

Bill

07/11/2023, 1:46 PM

type: float[] ?

Kishore Nallan

07/11/2023, 1:47 PM

You need to define num_dim as well to make it a vector field. Please refer to the docs.

Bill

07/11/2023, 1:47 PM

Copy code

{
  "name": "vec",
  "type": "float[]",
  "num_dim": 4
}

Bill

07/11/2023, 1:48 PM

Now I get this error -> "error":"Field

vec

has been declared in the schema, but is not found in the document." in collections/products/documents/import?action=create

Kishore Nallan

07/11/2023, 1:49 PM

Do you have that field in the documents ingested?

Bill

07/11/2023, 1:49 PM

My Schema:

Bill

07/11/2023, 1:49 PM

Copy code

{
  "name": "products",
  "fields": [
    {
      "name": "title",
      "type": "string",
      "locale": "sr"
    },
    {
      "name": "vec",
      "type": "float[]",
      "num_dim": 4
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}

Kishore Nallan

07/11/2023, 1:49 PM

If you are looking to do auto embedding then you have to define the model config. For that, see the gist.

Kishore Nallan

07/11/2023, 1:50 PM

Hmm that looks fine for me. Afk will check when I return to keyboard

Bill

07/11/2023, 1:50 PM

Yes I do auto embedding

Bill

07/11/2023, 1:51 PM

In which gist are you referring to?

Jason Bosco

07/11/2023, 3:09 PM

This one: https://typesense-community.slack.com/archives/C01P749MET0/p1689001399891689?thread_ts=1688995451.590979&cid=C01P749MET0

Jason Bosco

07/11/2023, 3:10 PM

Your schema should look like this:

Copy code

{
  "name": "products",
  "fields": [
    {
      "name": "title",
      "type": "string",
      "locale": "sr"
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "title"
        ],
        "model_config": {
          "model_name": "ts/paraphrase-multilingual-mpnet-base-v2"
        }
      }
    }
  ]
}

Bill

07/11/2023, 3:11 PM

But there is no vec field in this schema

Jason Bosco

07/11/2023, 3:12 PM

You’d need to define an explicit vector field with num_dim only if you’re generating embeddings outside of Typesense and importing them in

Jason Bosco

07/11/2023, 3:12 PM

If you’re using auto generated embeddings, then just setting the

embed

key in the field definition will generate and store the embeddings

Bill

07/11/2023, 3:12 PM

Yes

Jason Bosco

07/11/2023, 3:12 PM

You can rename

name: embedding

name: vec

if you need

Bill

07/11/2023, 3:13 PM

So the filter for search will be: 'vector_query' : 'embedding:([0.96826, 0.94, 0.39557, 0.306488], k:100)'

Jason Bosco

07/11/2023, 3:13 PM

Yup

Kishore Nallan

07/11/2023, 3:14 PM

Actually the query tokens will autoamtically be vectorized as well. So you don't need to send

vector_query

at all.

Kishore Nallan

07/11/2023, 3:15 PM

Just do

query_by=embedding

👍 1

Bill

07/11/2023, 3:15 PM

Okay I’ll test it. Why the Ram is too high in this version? In 0.24.1 the average ram is 60-100mb. But in this I need minimum an 8gb instance and the average Ram is 2.2 GB. If I delete the other models will be lower?

Bill

07/11/2023, 3:17 PM

@Kishore Nallan if I don’t send vector query I can’t limit the results. I get all the products

Kishore Nallan

07/11/2023, 3:17 PM

The embedding model must be held in memory

Kishore Nallan

07/11/2023, 3:18 PM

Can you try sending an empty array?

Kishore Nallan

07/11/2023, 3:19 PM

By limit, you mean using the vector_distance to pick only similar values right?

Bill

07/11/2023, 3:19 PM

Yes

Bill

07/11/2023, 3:20 PM

I want to get only lower than 0.7 distance

Kishore Nallan

07/11/2023, 3:20 PM

Check if

[ ]

works, otherwise, I will have to address this use case which we might not have accounted for.

Bill

07/11/2023, 3:20 PM

Bill

07/11/2023, 7:06 PM

I tested the "vector_query": "embedding:([0.3,0.4,0.5], distance_threshold:0.01)", but I get this error -> "error": "Query field

embedding

must have 768 dimensions.". Model: paraphrase-multilingual-mpnet-base-v2

Jason Bosco

07/11/2023, 7:23 PM

Could you try

"vector_query": "embedding:([ ], distance_threshold:0.01)"

Bill

07/11/2023, 7:24 PM

"error": "When a vector query value is empty, an

id

parameter must be present."

Jason Bosco

07/11/2023, 7:25 PM

Ah ok, yeah we didn’t account for this use-case of setting vector query params like

distance_threshold

when used with auto-embedding. Could you create a GitHub issue using this template with a set of curl commands that replicates the issue?

Bill

07/11/2023, 7:26 PM

Yes I'll create it

🙏 1

Bill

07/11/2023, 7:26 PM

The only way to limit the results is only by using vector_query?

Jason Bosco

07/11/2023, 7:26 PM

When you use auto-embedding - correct

Bill

07/11/2023, 7:27 PM

Yes with auto embedding

Jason Bosco

07/11/2023, 7:27 PM

per_page

and

page

might work, but it’s not going to go off of vector distance

Bill

07/11/2023, 7:27 PM

If i don't limit the results, I get all docs sorted by vec distance

Jason Bosco

07/11/2023, 7:28 PM

There is also a

parameter in vector_query, but that runs into the same issue - we need to add support for both

Bill

07/11/2023, 7:28 PM

Yes I read about the k parameter

Bill

07/11/2023, 7:29 PM

An other quick way to fix this issue is in frontend to remove items with vec distance eg. 0.7>

Jason Bosco

07/11/2023, 7:30 PM

Yeah that would work as a workaround

Jason Bosco

07/11/2023, 7:30 PM

But you bring up a good use-case, we definitely want to add support for k and distance_threshold

Bill

07/11/2023, 7:31 PM

Sure it will be very useful for auto embedding

Bill

07/11/2023, 7:37 PM

Where should I create the issue?

Jason Bosco

07/11/2023, 7:37 PM

Here: https://github.com/typesense/typesense/issues

👍 1

Bill

07/11/2023, 7:50 PM

I created the issue here - > https://github.com/typesense/typesense/issues/1099

👍 1

Bill

07/11/2023, 8:45 PM

What's the best way to search for multiple terms? For example, I want to find recipes with: apples, oranges, coffee. I tried with "q": "apples, oranges, coffee" but the results are not good. (The vec distance in coffee recipes is 0.7>)

Jason Bosco

07/11/2023, 9:02 PM

The only way to do precise searches like that would be to use

filter_by

instead of / or in combination with vector search

Bill

07/11/2023, 9:47 PM

If I do a search with eg. "q": "internet" I get all docs that are similar to this meaning. But if I do a search with q=* and filter_by= "title:= internet OR title:= coffee OR title:= cars, I get only the results that have these terms in title. It's like it doesn't use the model and it uses the normal search

Bill

07/11/2023, 9:48 PM

1st code:

"searches": [

"q": "internet",

"collection": "products",

"query_by": "embedding",

"exclude_fields": "embedding",

"prefix": *false*,

"per_page": 250

Jason Bosco

07/11/2023, 9:49 PM

filter_by only does keyword-based filtering. Only the

parameter is used for vector searches

Bill

07/11/2023, 9:50 PM

2nd code:

"searches": [

"q": "*",

"collection": "products",

"filter_by": "title:= internet || title:= coffee || title:= cars",

"query_by": "embedding",

"exclude_fields": "embedding",

"prefix": *false*,

"per_page": 250

Bill

07/11/2023, 9:50 PM

So it doesn't work in my case

Bill

07/11/2023, 9:50 PM

Is there any other way to find similarity in multiple terms?

Bill

07/11/2023, 9:50 PM

using q input

Jason Bosco

07/11/2023, 9:51 PM

“Similarity” is fully defined the model

Bill

07/11/2023, 9:51 PM

I mean docs near the terms

Jason Bosco

07/11/2023, 9:52 PM

You could try breaking it out into 3 searches within a multi-search, one search per term (using the q parameter) and then aggregate the results client-side

Bill

07/11/2023, 9:52 PM

If i do a search with q="coffee, cars" I get the results but the vec distance of cars docs is wrong

Bill

07/11/2023, 9:53 PM

It's higher than it should be

Jason Bosco

07/11/2023, 9:54 PM

Typesense does not control that distance metric in the sense that the model is the one that takes search terms and projects them into vector space

Jason Bosco

07/11/2023, 9:54 PM

So if it returns unexpected results, then the model is probably not a good fit for what you’re trying to do

Bill

07/11/2023, 9:56 PM

Ok, i'll break it into 3 searches with multi search

Bill

07/11/2023, 10:53 PM

I’m trying to create a suggestion system. For example, the user has searched for terms Harry Potter , fast n furious, wolf of wall street . I store these values and in the next app launch the user has to get the suggestions (movies about magicians, about cars or about economics) based on these 3 previous searches. What would be the best way to do this with typesense?

Jason Bosco

07/11/2023, 10:55 PM

To do user-level personalization based on search history, you would have to use a recommender model - send it data as users interact with the site. Then when a user lands on the site again, embed the user profile get the vectors then do a nearest neighbor search in Typesense to get recommended items to show them

Jason Bosco

07/11/2023, 10:57 PM

Here’s an interesting approach: https://medium.com/analytics-vidhya/recommendation-system-using-bert-embeddings-1d8de5fc3c56

Bill

07/11/2023, 11:38 PM

Yes that’s what I’m trying to do. I tried with multiple terms in q input parameter as I wrote you but no luck. So the only way is by doing 3-5 multi searches with the stored values (eg. Harry Potter etc). Can you do a nearest neighbor search with multiple terms?

Jason Bosco

07/12/2023, 1:57 AM

Yeah, you can seperate them by spaces… but the entire sentence will be embedded together and not one-by-one

Bill

07/12/2023, 10:06 AM

Yes I tested it with spaces but the vec distance of the docs is wrong. It would be very useful to have more terms in q separated by special character (eg. ||)

Kishore Nallan

07/12/2023, 3:42 PM

@Bill I've a fix for the distance_threshold param in

typesense/typesense:0.25.0.rc48

. You can now pass it like this:

Copy code

'vector_query': 'vec:([], distance_threshold: 0.25)'

Kishore Nallan

07/12/2023, 3:43 PM

Please try this branch locally first or on a staging environment.

Bill

07/12/2023, 7:26 PM

Yes it works now using 'vector_query': 'embedding:([], distance_threshold: 0.25)'. Is it production ready?

Kishore Nallan

07/13/2023, 1:23 AM

We are mostly on code freeze on this branch now, just ironing out last few blockers or bugs like this. So you can use it.

Bill

07/13/2023, 9:18 AM

In the stable version of 0.25.0 will be breaking changes to these parts (auto embedding, vector query etc)?

Kishore Nallan

07/13/2023, 9:20 AM

No api is frozen. Barring these little usability quirks we find.

👍 1

Bill

07/13/2023, 10:16 AM

Is it better to use auto embedding to multiple inputs eg. Title, description or only at one (title)? I’m asking because if the description has no info about the title, the distance will be longer.

Kishore Nallan

07/13/2023, 10:55 AM

We currently support querying against only 1 embedding so you have to try and represent the product in a single piece of text which can be embedded.

Bill

07/13/2023, 10:59 AM

I noticed an other model issue. I use the model: paraphrase-multilingual-mpnet-base-v2 and in languages like Greek if I search with q: "Προγραμματιστής", I get the doc with title "Προγραμματιστής" but the vector distance is 3.4724921e-733. If I search with q: "προγραμματιστής" the distance is correct (0.002341,,)

Kishore Nallan

07/13/2023, 11:01 AM

The models are basically a blackbox. Unless there is a systemic issue which maps all words wrongly, it's likely a quirk of the model. Non-English models are probably not that great.

Bill

07/13/2023, 11:07 AM

So in this case the best solution is to lowercase the search term?

Bill

07/13/2023, 11:09 AM

Is this model supported by typesense -> https://huggingface.co/intfloat/multilingual-e5-large/tree/main? In onnx folder there are these files: config.json, model.onnx, model.onnx_data, sentencepiece.bpe.model

Kishore Nallan

07/13/2023, 11:12 AM

It's not on our huggingface remote yet. But if you can convert it the model to onnx and put them on disk locally, you should be able to use it.

Bill

07/13/2023, 11:12 AM

It is converted in onnx format. But it has model.onnx, model.onnx_data

Bill

07/13/2023, 1:48 PM

Should I edit the config file? It doesn't match the config files in typesense folder

Kishore Nallan

07/13/2023, 1:50 PM

I'll have to look. It doesn't seem to conform to the conventions of existing models.

Bill

07/13/2023, 1:50 PM

The config in typesense file is: { "model_md5": "728d3db98e1b7a691a731644867382c5", "vocab_file_name": "sentencepiece.bpe.model", "vocab_md5": "bf25eb5120ad92ef5c7d8596b5dc4046", "model_type": "xlm_roberta" }

Bill

07/13/2023, 1:50 PM

And in this files is: { "_name_or_path": "intfloat/multilingual-e5-large", "architectures": [ "XLMRobertaModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "xlm-roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 1, "position_embedding_type": "absolute", "transformers_version": "4.30.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 250002 }

Bill

07/13/2023, 1:51 PM

I tried to convert it to: { "vocab_file_name": "sentencepiece.bpe.model", "model_type": "xlm_roberta" } But i get this error-> "message": "Failed to download model file"

Kishore Nallan

07/13/2023, 1:52 PM

I'll get back to you tomorrow after investigation

Bill

07/13/2023, 1:52 PM

With schema: { "name": "productsNew", "fields": [ { "name": "product_name", "type": "string" }, { "name": "embedding", "type": "float[]", "embed": { "from": [ "product_name" ], "model_config": { "model_name": "ts/multilingual-e5-large" } } } ] }

Bill

07/13/2023, 1:52 PM

Kishore Nallan

07/14/2023, 5:48 AM

The onnx file in the model is setup differently from what we have from other models, so need to spend some time to see how to integrate this. We're currently tied up with the last stretch of tasks for the 0.25 release. We will only be able to look into this after that, perhaps in 10-12 days.

Bill

07/14/2023, 9:54 AM

Ok, I’ve read that the most precise free nlp multilingual model right now is the paraphrase that you have already uploaded to typesense folder, so for now it’s ok. The only issue that has is with camelcased letters but I correct it in client side

👍 1

Bill

07/14/2023, 11:40 AM

I created a new collection and I get this error in all nodes: "Not Ready or Lagging" and at random times 502: Bad gateway

Kishore Nallan

07/14/2023, 11:46 AM

Are you indexing?

Kishore Nallan

07/14/2023, 11:47 AM

If writes exceed the write lag threshold server returns 502

Bill

07/14/2023, 11:47 AM

Yes I uploaded some docs. I have deployed it in a 2 CPU -4 GB RAM and when I created the collection with embedding model, after 10s I got a 503

Bill

07/14/2023, 11:47 AM

It;s like the nodes are not synced

Kishore Nallan

07/14/2023, 11:48 AM

Cluster or single node? Which version of Typesense?

Bill

07/14/2023, 11:48 AM

Using /debug I get random state: 4 and state: 0

Bill

07/14/2023, 11:48 AM

cluster

Bill

07/14/2023, 11:48 AM

v0.25.0.rc48

Kishore Nallan

07/14/2023, 11:49 AM

Difficult to say what's happening without looking at logs

Bill

07/14/2023, 11:49 AM

Is there any way to reset the nodes or only by fresh install?

Bill

07/14/2023, 11:51 AM

I tried to have only 1 node but then the state is 4 instead of 1

Bill

07/14/2023, 12:01 PM

I also tried to Re-elect Leader but the state is still 4

Kishore Nallan

07/14/2023, 12:05 PM

There's a reset peers api. You can try that. It will force the cluster to peer again

Bill

07/14/2023, 12:05 PM

Whats the api endpoint?

Kishore Nallan

07/14/2023, 12:31 PM

POST /operations/reset_peers

Bill

07/14/2023, 12:43 PM

It's still state: 4

Bill

07/14/2023, 12:44 PM

How can I do a fresh install?

Bill

07/14/2023, 12:45 PM

Delete all typesense's log folders?

Kishore Nallan

07/14/2023, 12:48 PM

You have to stop all nodes and delete the data directory and start back.

Bill

07/14/2023, 12:48 PM

data dir and logs?

Bill

07/14/2023, 12:51 PM

I deleted var/logs/typesense, var/lib/typesense and the state is still 4

Bill

07/14/2023, 12:51 PM

*and sudo apt remove typesense-server

Kishore Nallan

07/14/2023, 12:53 PM

Hard to help further without looking at the logs

Bill

07/14/2023, 1:24 PM

Ok fixed it. The issue was with the RAM on indexing. The model needs at least an 8gb to create the collection, then it can be resized to 4GB

Kishore Nallan

07/14/2023, 1:48 PM

Ok glad to hear.

Kishore Nallan

07/14/2023, 1:48 PM

Which model is this?

Bill

07/14/2023, 2:01 PM

I use the paraphrase

Bill

07/14/2023, 2:33 PM

How can I use "sort_by" : "vector_distance:asc, timestamp:desc" ? I get this error: "Could not find a field named

vector_distance

in the schema for sorting"

Bill

07/14/2023, 2:34 PM

If I have sort_by: timestamp:desc, I get results sorted only by timestamp and not by vector_distance

Kishore Nallan

07/14/2023, 2:37 PM

Try

_vector_distance:desc

🙌 1

Bill

07/19/2023, 8:46 PM

@Kishore Nallan I noticed an other issue with build 0.25.0.rc53. When the server reboots, the typesense server auto starts but the collection with embedding fields is not loaded automatically. The collection will be loaded to the memory only if there is a new request (multi_search with query_by ermbedding). As a result the first request is always error 500. { "results": [ { "code": 500, "error": "Request timed out." } ] }

Bill

07/19/2023, 8:52 PM

Also, i tested it to 2 different droplet sizes. 1) 4 cpu - 8gb ram, 2) 2 cpu - 4gb ram. In 1st configuration the memory consumption is at 1.1GB, in 2nd configuration the memory is at 2.2GB.

Bill

07/20/2023, 2:22 PM

@Kishore Nallan @Jason Bosco any idea?

Kishore Nallan

07/20/2023, 2:33 PM

Are you saying that on restart, once the first search request hits the collection, that's when the collection is actually loaded?

Bill

07/20/2023, 2:43 PM

Yes, that’s happening only if the collection has embedding field. On restart, the instance loads only the typesense server with other collections and the Ram is about 60-70mb. When I search with multi search query by embedding, I get error 500 on first request and then when the collection has been loaded the Ram is at 1.6GB. So it seems like it loads the model only after 1st query

Kishore Nallan

07/20/2023, 2:46 PM

Super odd. Will take a look and get back to you. Late here now so, tomorrow.

Bill

07/20/2023, 2:46 PM

Kishore Nallan

07/21/2023, 10:09 AM

@Bill I'm not able to reproduce this. Here's what I did: 1. Created a collection using the

e5-small

model and an auto-embedding field. 2. Indexed 100K documents into the collection 3. Once the import was done, I confirmed the collection document count and also noted the memory usage via the /metrics.json end-point 4. Stopped and started Typesense server 5. I'm hitting the metrics end-point and the documents are getting indexed and memory usage is increasing.

Bill

07/21/2023, 11:03 AM

@Kishore Nallan I created a collection using the paraphrase model. All the other models (e5-small etc.) are small in size ~ 500mb so that's why you don't get a 500 error at first request. The paraphrase model is >1 GB. Could you reproduce with paraphrase-multilingual-mpnet-base-v2?

Kishore Nallan

07/21/2023, 11:09 AM

Just to re check: are you saying that if you restarted the server and just let it run for 30 mins, and you check the memory usage after that, it never loads the data or the model back at all?

Kishore Nallan

07/21/2023, 11:09 AM

Base line memory usage on a fresh instance is about 40-50MB

Bill

07/21/2023, 11:10 AM

When you start the server, with the collection (using model paraphrase) the RAM consumption is about 40-50MB. That is normal

Bill

07/21/2023, 11:12 AM

When you send the first request in the collection using multi search with embedding,it hasn't loaded the model (that's why the RAM at first is about 40-50MB at first), and it loads the model. Because the model is ~ 1.5GB it needs more time to load it into RAM so you get error 500 timeout.

Bill

07/21/2023, 11:12 AM

So in other words, the collection with the model doesn't loaded on server start, thats why the RAM is 40-60 MB at first instead of 1.6GB that should be with the model loaded in RAM

Bill

07/21/2023, 11:13 AM

The model will be loaded only at the first multi search request

Kishore Nallan

07/21/2023, 11:16 AM

Got it, let me try with that model in a bit. Though technically the code path is common for all models. Did you get the same result on multiple tries?

Bill

07/21/2023, 11:16 AM

Yes I tested it multiple times

Kishore Nallan

07/21/2023, 11:17 AM

Ok will get back to you

Bill

07/21/2023, 11:17 AM

Kishore Nallan

07/21/2023, 2:28 PM

Tried, but unable to reproduce. 🤔

Kishore Nallan

07/21/2023, 2:58 PM

Can you try this? https://gist.github.com/kishorenc/d63d96eb173cff4e80a51d35b828967a It uses a single document, but the issue should be the same if the document really isn't loaded on restart it should not matter how many documents are indexed.

Kishore Nallan

07/21/2023, 3:00 PM

It's perhaps possible that there was some issue with Huggingface in fetching the model around the time you accessed?

Bill

07/21/2023, 3:00 PM

In this example what’s the memory consumption before and after the search query?

Kishore Nallan

07/21/2023, 3:09 PM

Ok hold on I briefly reproduced it. But memory consumption didn't change before/after search query. I will try to narrow it down now.

Bill

07/21/2023, 3:10 PM

The issue is that the typesense server doesn’t start with 1.6GB Ram (loaded model) and it starts with 60mb

Kishore Nallan

07/21/2023, 3:10 PM

What exact version of TS you are using now?

Bill

07/21/2023, 3:11 PM

0.25.0.rc53

Kishore Nallan

07/21/2023, 3:11 PM

The low memory wasn't an issue I saw. Anyway will post an update after further investigation.

Bill

07/21/2023, 3:11 PM

Bill

07/21/2023, 3:14 PM

To reproduce better: 1) create a collection with embedded paraphrase model, 2) index some docs (eg. 5). 3) restart the server 4) check Ram, it will be about 60mb.5) do a search request with multi search vector searching 6) you will get an error 500 timeout, 7) check memory, now the memory will be 1.6GB (loaded model), 8) do the same search query and it works now. So the issue is that on auto start it doesn’t load the model

Kishore Nallan

07/21/2023, 3:30 PM

At Step 4) I am already seeing memory increasing to 1.8 GB. I also see this in the logs very soon after restart:

Copy code

Loading model from disk: /tmp/data/models/paraphrase-multilingual-mpnet-base-v2/model.onnx

This is the step increases the memory usage. Do you see this log before making a search request?

Kishore Nallan

07/21/2023, 3:30 PM

If not, can you share the logs you get for the first 2 minutes after a restart?

Bill

07/21/2023, 3:30 PM

Do you use the version rc53?

Kishore Nallan

07/21/2023, 3:31 PM

Yes

Bill

07/21/2023, 3:32 PM

Ok I’ll test it again and I’ll send you the logs

Bill

07/21/2023, 3:32 PM

I have deployed it in a 2cpu - 4 gb Ram. Is this an issue?

Kishore Nallan

07/21/2023, 3:33 PM

Copy code

I20230721 21:02:24.300542 130060 typesense_server_utils.cpp:331] Starting Typesense 0.25.0.rc53
...
I20230721 21:02:39.070822 130064 text_embedder.cpp:21] Loading model from disk: /tmp/data/models/paraphrase-multilingual-mpnet-base-v2/model.onnx

Kishore Nallan

07/21/2023, 3:33 PM

I'm trying this locally. See how the "loading model" log happens within 30 seconds of starting the server.

Kishore Nallan

07/21/2023, 3:34 PM

I did briefly see the timed out issue though but when I had more docs indexed. I will still chase that down, maybe its related.

Kishore Nallan

07/21/2023, 3:38 PM

I don't the 4 GB RAM is an issue. Will update what I find.

Bill

07/21/2023, 3:43 PM

I checked the logs, there is no “Loading model from disk: /var/lib/typesense/models/paraphrase-multilingual-mpnet-base-v2/model.onnx”

Bill

07/21/2023, 3:43 PM

Only appears after a search request

Bill

07/21/2023, 3:51 PM

I20230721 15:36:01.550168 26330 typesense_server_utils.cpp:331] Starting Typesense 0.25.0.rc53

I20230721 15:36:01.550246 26330 typesense_server_utils.cpp:334] Typesense is using jemalloc.

I20230721 15:36:01.550557 26330 typesense_server_utils.cpp:384] Thread pool size: 16

I20230721 15:36:01.553516 26330 store.h:64] Initializing DB by opening state dir: /var/lib/typesense/db

I20230721 15:36:01.571373 26330 store.h:64] Initializing DB by opening state dir: /var/lib/typesense/meta

..........

I20230721 15:36:01.631971 26469 raft_server.cpp:508] Loading collections from disk...

.....

I20230721 15:36:01.913641 26469 collection_manager.cpp:301] Loaded 2 collection(s).

I20230721 15:36:01.913944 26469 collection_manager.cpp:305] Initializing batched indexer from snapshot state...

I20230721 15:36:01.913995 26469 batched_indexer.cpp:446] Restored 0 in-flight requests from snapshot.

I20230721 15:36:01.914005 26469 raft_server.cpp:515] Finished loading collections from disk.

W20230721 15:36:01.914573 26460 raft_server.cpp:591] Multi-node with no leader: refusing to reset peers.

I20230721 15:36:01.983656 26470 raft_server.h:288] Node starts following { leader_id=1.112.0.2:8107:8108, term=74, status=Follower receives message from new leader with the same term.}

I20230721 15:36:11.920372 26460 raft_server.cpp:564] Term: 74, last_index index: 42907, committed_index: 42907, known_applied_index: 42907, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 113332

Kishore Nallan

07/21/2023, 3:52 PM

Ah looks like it is loading collections from disk so probably restoring off a snapshot. Snapshot runs every one hour which persists the DB and compacts it. So perhaps it's that code path which triggers this behavior. I will try that too.

Bill

07/21/2023, 3:52 PM

after the multi search request I get as response:

"results": [

"code": 500,

"error": "Request timed out."

Kishore Nallan

07/21/2023, 3:53 PM

Bill

07/21/2023, 3:54 PM

After the request I get this in logs:

I20230721 15:52:01.187388 26367 text_embedder.cpp:21] Loading model from disk: /var/lib/typesense/models/paraphrase-multilingual-mpnet-base-v2/model.onnx

Bill

07/21/2023, 3:54 PM

So what's the issue?

Kishore Nallan

07/24/2023, 4:15 AM

Found the issue and fixing it. Will share a build later today.

Bill

07/24/2023, 9:31 AM

Kishore Nallan

07/24/2023, 1:04 PM

@Bill I've published

0.25.0.rc54

Bill

07/24/2023, 1:07 PM

@Kishore Nallan Ok, I'll deploy it to my server

👍 1

Bill

07/24/2023, 1:15 PM

I've found an other bug with version0.25.0.rc53

Kishore Nallan

07/24/2023, 1:16 PM

Shoot 😄

Bill

07/24/2023, 1:17 PM

My payload in multi_search { "searches": [ { "collection": "books", "q": "sea", "query_by": "title, embedding", "exclude_fields": "embedding", "prefix": false, "vector_query": "embedding:([], distance_threshold: 0.30)", "per_page": "250", "sort_by": "location(0.0,0.0):asc", "page": 1 } ] }

Bill

07/24/2023, 1:17 PM

When you add sort_by location the results are not sorted by geo_distance_meters asc.

Bill

07/24/2023, 1:18 PM

The first item I get has "geo_distance_meters": { "location": 60098 }, and the 2nd "geo_distance_meters": { "location": 5315 }, Which is wrong because the 2nd item should be first

Bill

07/24/2023, 1:19 PM

If you do a normal search without embedding -> "query_by": "title" and without vector_query. The results are correct

Kishore Nallan

07/24/2023, 1:19 PM

Is this an issue only with geo field or any sort by is neglected?

Bill

07/24/2023, 1:20 PM

I tested it only with geo right now. I'll check with other values too.

Bill

07/24/2023, 1:23 PM

Tested it with other values too (sort_by: datePublishedTimestamp). I think that when you use sort_by and query_by: embedding it doesn't work.

Kishore Nallan

07/24/2023, 1:34 PM

Got it, I will look.

👍 1

Kishore Nallan

07/25/2023, 9:30 AM

@Bill I'm not able to actually reproduce it. Would you able to share a small snippet? Here's what I tried:

Copy code

<http://localhost:8108/collections/docs/documents/search?q=the&query_by=title,embedding&x-typesense-api-key=abcd&sort_by=points:desc&include_fields=points&vector_query=embedding:([]>, distance_threshold:0.30)

It's returning me hits sorted descending by

points

accurately.

Kishore Nallan

07/25/2023, 9:46 AM

Perhaps it's only related to geo searches? Will be good to get a re-confirmation on that.

Bill

07/25/2023, 10:42 AM

@Kishore Nallan did you try with geo search?

Kishore Nallan

07/25/2023, 10:43 AM

Yes just managed to reproduce with geopoint. Will post a fix later. Thanks.

Kishore Nallan

07/25/2023, 10:44 AM

No geo fileds seem okay. Issue seems to be only with sort by on geo

Bill

07/25/2023, 10:44 AM

Ok, i'll check again with sortby timestamp

Bill

07/25/2023, 10:44 AM

Yes geo fields are ok, the geo search doesn't work

Kishore Nallan

07/25/2023, 10:45 AM

Geo search meaning filter on geo fields? Post the payload please.

Bill

07/25/2023, 10:46 AM

The payload is: { "searches": [ { "collection": "books", "q": "sea", "query_by": "title, embedding", "exclude_fields": "embedding", "prefix": false, "vector_query": "embedding:([], distance_threshold: 0.30)", "per_page": "250", "sort_by": "location(0.0,0.0):asc", "page": 1 } ] }

Bill

07/25/2023, 10:46 AM

The results are not sorted as I wrote above. I get the following: "geo_distance_meters": { "location": 60098 }, and the 2nd "geo_distance_meters": { "location": 5315 },

Kishore Nallan

07/25/2023, 10:47 AM

Yes that's what I was able to reproduce. Sort by on geo field.

Kishore Nallan

07/25/2023, 10:47 AM

Sorry typo above. I meant NON geo fields are okay.

Kishore Nallan

07/25/2023, 10:47 AM

Typed as "no geo fields" 😞

Kishore Nallan

07/25/2023, 10:48 AM

Will shared a fixed build later today.

Bill

07/25/2023, 10:48 AM

Bill

07/25/2023, 10:56 AM

The issue about the sort_by datePublishedTimestamp (default_sorting_field) is more complex. If I do sort_by: datePublishedTimestamp:asc or sort_by: datePublishedTimestamp:desc I get the same results in the same order, it's like it doesn't count the "asc" or "desc", BUT if I use sort_by in a field that I hadn't set it as the default_sorting_field, it works as expected.

Bill

07/25/2023, 10:57 AM

Could you reproduce it and use sort by in the default_sorting_field with query_by: eg. title, embedding?

Kishore Nallan

07/25/2023, 12:01 PM

Ok will checkt that

👍 1

Kishore Nallan

07/25/2023, 4:13 PM

Fixed the issue with geo query, but still looking into the other. Will share a build once that's done too.

Bill

07/26/2023, 8:59 AM

Ok, I wait for the build

Kishore Nallan

07/26/2023, 11:35 AM

Ok please try with

0.25.0.rc56

I could not reproduce the issue with default sorting field, but I wonder if that's because it's fixed by the other change.

Bill

07/26/2023, 2:02 PM

@Kishore Nallan I upgraded the sever to version rc56 and it works now! All issues fixed: 1) collection with embedding autoloaded, 2) sortby location geo, 3) sortby other value. 🙌

Kishore Nallan

07/26/2023, 2:05 PM

Awesome thank you for the feedback and help

👍 1

Bill

07/26/2023, 6:09 PM

If I find anything else I’ll inform you

👍 1

Bill

07/27/2023, 11:49 AM

@Kishore Nallan I did some load tests and the CPU requirements are too high. I have created a basic collection (with embedding - vectoring) and I did a search using multi_search. The results are: 1. 2vcpu - 4gb RAM -> total max concurrent reqs/sec 25 2. 4 vcpu - 8GB RAM -> total max concurrent reqs/sec 50 All these tests had 100% in all CPU cores. If you do a normal search without embedding, even with the 1GB ram - 1vcpu droplet I can have 50 concurrent reqs. Is vectoring search possible for scale without GPU?

Kishore Nallan

07/27/2023, 11:55 AM

Most of these models have 30-100 million parameters. So they are going to be pretty intensive unfortunately.

Bill

07/27/2023, 12:16 PM

Have you tested it with GPU support? Could it handle more concurrent requests?

Kishore Nallan

07/27/2023, 12:19 PM

Yes, the GPU version is significantly faster.

Kishore Nallan

07/27/2023, 12:19 PM

However we have benchmarked only imports, and not queries.

Bill

07/27/2023, 12:23 PM

Ok so for now the best way for scaling is multiple nodes (5) with multiple vCPUs, right?

Kishore Nallan

07/27/2023, 12:23 PM

Yes. We do have plans to improve this in future as quantized models become more available.

👍 1

Open in Slack

Previous Next