#community-help

Discussion on Performance and Scalability for Multiple Term Search

TLDR Bill asks the best way for multi-term searches in a recommendation system they developed. Kishore Nallan suggested using embeddings and remote embedder or storing and averaging vectors. Despite testing several suggested solutions, Bill continued to face performance issues, leading to unresolved discussions about scalability and recommendation system performance.

Powered by Struct AI

3

Nov 16, 2023 (2 weeks ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
03:40 PM
Yes
03:41
Bill
03:41 PM
Is there any other way except from multi search?
Nov 17, 2023 (2 weeks ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:07 AM
You don't have to actually do 10 vector searches. You can store the vector for each book and then when people want recommendation, fetch those 10 vectors, average them, and do a single vector search with the average vector. This should produce recent recommendations. The average operation will approximate the individual search + aggregation.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:18 AM
Kishore Nallan each user has its own favorites. These 10 are not the same for every user.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:18 AM
I know but you can store the id -> embedding and look them up right.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:20 AM
For example, we have a recommendation system for recipes. User A wants the following ingredients: egg, tomato, bacon. In order to get recipes with these ingredients I do a multi search for each one. So if user has 10 ingredients, that’s a 10 loop multi search
10:21
Bill
10:21 AM
Kishore Nallan so when a new book is created store the embedding and lookup. Does this require a GPU as well?
10:22
Bill
10:22 AM
In this case I won’t have dynamic embedding for search, i search based on the embedding
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:22 AM
The averaging operation is still dynamic.
10:22
Kishore Nallan
10:22 AM
You probably don't need a GPU for the searches. But will depend on the concurrency.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:23 AM
Okay I’ll test it, thank you

1

10:27
Bill
10:27 AM
If I use vector search based on id, can I have hybrid search?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:33 AM
Yes. The docs should have an example.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:36 AM
Ok checked it but in my case it’s not a solution because the user’s variables are not static so I can’t store standard embeddings
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:37 AM
What are user variables? Free text? I suggested this approach because earlier you said user had a list of items.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:38 AM
A user can have a dynamic list of items (recent searches)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:38 AM
Search queries?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:38 AM
It’s free text that I store and then do a multi search
10:38
Bill
10:38 AM
Yes search queries
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:39 AM
Ok then that won't work.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:39 AM
My issue is with the multi search that takes a lot of time for 10 loops
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:41 AM
You don't need to do search for 10 times though. You need to generate embeddings for all 10 items, average them and then do single hybrid search with that average vector
10:42
Kishore Nallan
10:42 AM
Generating embedding for 10 items still intensive without GPU but just saying.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:43 AM
I don’t use open ai or Google, I use auto embedding, so how can I generate embedding for the search term?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:43 AM
Not possible within typesense presently. You have to do it on your application side.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:44 AM
Okay so I have to generate embeddings for each search term (10 terms). What do you mean average them?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:46 AM
Take the ith float value in each of the 10 embedding and divide by 10.
10:46
Kishore Nallan
10:46 AM
I mean, sum and divide by 10
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:48 AM
What do you mean ith float value?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:53 AM
Given two vectors [a1, a2, a3] and [b1, b2, b3] average vector is: [a1+b1/2, a2+b2/2, a3+b3/2]

1

Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
10:54 AM
Ok got it thank you
10:58
Bill
10:58 AM
Will it be available in a future release to just get only the embedding without storing in a document? That would be very helpful for dynamic search terms
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:01 AM
Please create an issue, we will add to our roadmap.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
11:41 AM
Ok
01:27
Bill
01:27 PM
Kishore Nallan I sum up the embeddings and it works great! But, I tested (Nearest-neighbor vector search) in 2 different instances and I get same response time. In instance A (2vCPU - 4 GB RAM) with 60 concurrent requests I get average response time to 1.33s and in instance B (4 vCPU - 8GB RAM) with 60 concurrent requests I get 1.29s. Is this right?
01:28
Bill
01:28 PM
Payload (combined embeddings for 10 search terms):
{
"searches": [
{
"collection": "test",
"q": "*",
"per_page":200,
"include_fields": "test",
"vector_query": "embedding:([-0.1110641211271286,0.757041597738862,-0.098668213468045,0.46491856407374144,0.1574960257858038,.......-1.0089502930641174], k:50)"
}
]
}
01:31
Bill
01:31 PM
Does it use all cores?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:41 PM
Can you run htop on your server when the benchmark runs? That will tell us if CPU is saturated or not.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:43 PM
All cores seems like working at 37% max
01:43
Bill
01:43 PM
tested on 50 cocnurrent reqs and I got 1.13sec response time
01:44
Bill
01:44 PM
I tested also in 2vCPU - 50 concurrent reqs -> also 1.13 sec
01:44
Bill
01:44 PM
typesense version 0.25.2.rc9
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:44 PM
What is the time with concurrency of say 3
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:45 PM
avg=159.39ms min=118.87ms med=150.76ms max=395.87ms
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:45 PM
How many vectors in index?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:46 PM
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:46 PM
per_page will override k btw so you are fetching 200 records which might be slow.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:46 PM
ok i'll test it with 20
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:46 PM
Set both to 20
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:48 PM
3 concurrent reqs: avg=148.65ms min=99.79ms med=139.81ms max=370.75ms
01:48
Bill
01:48 PM
50 concurrent reqs: avg=1.11s min=111.6ms med=795.87ms max=6.56s
01:49
Bill
01:49 PM
How is possible the 2vCPU instance have the same response time like the 4vCPU?
05:32
Bill
05:32 PM
Any idea? Kishore Nallan Jason
Nov 18, 2023 (2 weeks ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:09 PM
Tested it today with a 16vCPU - 32gb RAM CPU Optimized instance and I get again 1.1 seconds for 50 concurrent requests, Does Nearest-neighbor vector search use all instance's RAM-CPU?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:15 PM
I'm surprised that the additional cpu cores are not helping with lower latency on higher concurrency. This usually happens when the determining factor is memory bandwidth rather than cpu. For example this is the same behaviour you see with LLM inference. Memory access is a bottleneck so more CPUs don't help. This is my hypothesis, have to independently verify.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:28 PM
Kishore Nallan I have tested it with many different instances and all latest versions of typesense (0.25.2.rc, 0.26.0.rc) and I get the same response time.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:30 PM
I mean verifying the memory bandwidth part which will require instrumentation of that.
01:31
Kishore Nallan
01:31 PM
One other thing you could try doing is to increase the default thread pool size. Try setting --thread-pool-size to 500 and check.
01:32
Kishore Nallan
01:32 PM
Default is 8 * number of cpu cores.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:32 PM
What can be the issue here? I used htop and all cpu cores (16) were at 3-7% usage
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:25 PM
Memory bandwidth is shared across all CPUs.
02:25
Kishore Nallan
02:25 PM
Use multiple smaller instances than a few larger ones.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:34 PM
What’s the solution? I don’t understand. Isn’t vector search cpu intensive?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:54 PM
All CPUs share the same memory bandwidth. Once that's saturated CPUs will be waiting for the data to be read from the memory. So you have to use more smaller machines than few larger machines for throughput.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:56 PM
So every instance can support only 50 concurrent requests even if it has more CPUs?
02:57
Bill
02:57 PM
I don’t think that this is scalable
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:06 PM
I need to investigate further before I can comment further.
03:07
Kishore Nallan
03:07 PM
How many records are you searching on?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
03:07 PM
Ok, 250
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:08 PM
250 docs in the collection being searched on?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
03:14 PM
Yes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:27 PM
Please export those docs, collection schema and exact query you are using for benchmarking and email [email protected]

I'll have to look.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
03:32 PM
Ok. It’s a basic collection with one field (title). Nothing special
06:34
Bill
06:34 PM
I set thread-pool-size=500 but nothing changed. 1.09 seconds response time
06:46
Bill
06:46 PM
I sent you an email with the payload, collection scheme and docs to reproduce it. Subject of email: Embeddings search issue with concurrent requests.
06:47
Bill
06:47 PM
For benchmarking I use k6 -> k6 run --vus 50 --duration 10s script.js
Nov 19, 2023 (1 week ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:36 PM
Will take a look this coming week
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:38 PM
Jason Ok. This is very critical because it's not scalable. Can I have more than 5 nodes in typesense? (As a temporary solution)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:39 PM
You can, any odd number of nodes should work. But each added node will increase write latency, since that many more nodes need to ack the write
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
02:42 PM
Ok, so it's not the best solution. Please check, in addition to this issue with concurrent requests, if it's possible just to generate the embedding from the model without storing it in a document. For example, just generate the embedding for search term: "ship". In this way, it doesn't need an external service to generate only embeddings (eg. Hugging face).
Nov 22, 2023 (1 week ago)
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
12:21 PM
Is there progress about this critical issue?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:58 PM
We've not had a chance to setup the benchmarking suite on AWS instances because we've been tied up with higher priority items. However, I took some time off to do a quick check locally using siege. Here's what I am getting:

{    "transactions":                   22439,
    "availability":                  100.00,
    "elapsed_time":                   19.42,
    "data_transferred":               17.44,
    "response_time":                0.04,
    "transaction_rate":             1155.46,
    "throughput":                    0.90,
    "concurrency":                   49.84,
    "successful_transactions":           22439,
    "failed_transactions":                   0,
    "longest_transaction":                0.14,
    "shortest_transaction":                0.01
}

Command used: https://gist.github.com/kishorenc/4be95e9242bd4b183d5cf74f7eae202c

Can you give this a shot locally so we have a quicker common reference point?
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:01 PM
In local environment you can't check response time. You should check it in specific instance sizes in order to check that the response time is the same across different instances.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:02 PM
I know, but they won't vary 1000x
01:03
Kishore Nallan
01:03 PM
Also with remote benchmarking, there is an issue of networking overhead as well. I want to first rule out that there is nothing wrong with core vector search performance. For that a local benchmark is a good starting point. Otherwise the issue is either networking or a bug in the benchmarking setup.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:03 PM
I think that issue is how the the multi_search works. It runs queries in sequence instead of parallel
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:04 PM
That's not a problem here: there's only one query in multi search.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:04 PM
In 50 concurrent requests per second, it runs the requests in secuence
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:05 PM
No, it doesn't.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:05 PM
How can be possible a 2 vCPU has the same response time as a 4 vCPU or even a 8 vCPU?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:05 PM
The sequential part is searches within a single multi search request body. Independent HTTP requests are processed in parallel. When I run the siege benchmark I see all CPU cores shoot up to 100%. When I reduce concurrency to 10 requests, response time decreases.
01:06
Kishore Nallan
01:06 PM
> How can be possible a 2 vCPU has the same response time as a 4 vCPU or even a 8 vCPU?
I don't know, but can you try with the siege command above? You can run it from another instance if you want.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:06 PM
Yes I also see the CPU cores to 100% with 50 concurrent requests.
01:07
Bill
01:07 PM
Have you tried it with more CPU cores?
01:07
Bill
01:07 PM
It's like it has a fixed response time at x requests
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:07 PM
But at 50 concurrency, I'm getting 40 ms response times already
01:08
Kishore Nallan
01:08 PM
I urge you to try with siege and see what you get. At this point I suspect the benchmark harness. I just don't see how it can vary so widely if there is a problem.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:08 PM
Ok I'll check it with siege. Did you try the benchmark with k6?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:09 PM
Not yet, our hands have been full with some high priority items the past couple of days.
Bill
Photo of md5-be53735a2b0297bb542711c1d2ecea45
Bill
01:09 PM
Ok

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Enhancing Vector Search Performance and Response Time using Multi-Search Feature

Bill faced performance issues with vector search using multi_search feature. Jason and Kishore Nallan suggested running models on a GPU and excluding large fields from the search. Through discussion, it was established that adding more CPUs and enabling server-side caching could enhance performance. The thread concluded with the user reaching a resolution.

3

140
1mo

Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense

Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.

11

225
4mo

Integrating Semantic Search with Typesense

Krish wants to integrate a semantic search functionality with typesense but struggles with the limitations. Kishore Nallan provides resources, clarifications and workarounds to the raised issues.

6

75
11mo

Implementing Semantic Search with Typesense

Erik sought advice for semantic search implementation in Typesense and raised issues around slow document import and excessive latency. Upon implementing advice from Kishore Nallan to try different models, Erik reported faster times, ultimately deciding to rate-limit imports.

1

17
1mo

Optimizing Typesense Implementation for Large Collections

Oskar faced performance issues with his document collection in Typesense due to filter additions. Jason suggested trying a newer Typesense build and potentially partitioning the data into country-wise collections. They also discussed reducing network latency with CDN solutions.

5

67
11mo