#community-help

Optimum Cluster for 1M Documents with OpenAI Embedding

TLDR Denny inquired about the ideal cluster configuration for handling 1M documents with openAI embedding. Jason recommended a specific configuration, explained record size calculation, and clarified embedding generation speed factors and the conditions that trigger openAI.

Powered by Struct AI

2

12
1mo
Solved
Join the chat
Sep 01, 2023 (1 month ago)
Denny
Photo of md5-9ad0e03e3fe5fac2f75e521afcf2a7d7
Denny
12:37 AM
Hi - If I have about 1M documents (using embedding with openAI), what cluster should I be using? Approx 2 request per second or ~160k per day.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:38 AM
The vector search section under here gives you a formula for calculating RAM usage: https://typesense.org/docs/guide/system-requirements.html#choosing-ram
Denny
Photo of md5-9ad0e03e3fe5fac2f75e521afcf2a7d7
Denny
12:43 AM
thanks for checking!

1

01:17
Denny
01:17 AM
how do i check the size of each record?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:18 AM
For keyword search: https://typesense.helpscoutdocs.com/article/161-how-do-i-calculate-average-record-size

For vector search, it's just number of dimensions * 7 bytes
Denny
Photo of md5-9ad0e03e3fe5fac2f75e521afcf2a7d7
Denny
01:33 AM
Will increasing the size of my cluster make embedding generation faster?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:49 AM
If you’re using OpenAI, embedding generation speed is completely dependent on OpenAI api response times, which I’ve seen take a second or two for each api call.

If you’re using a built-in model, then enabling GPU Acceleration will speed up the embedding generation process: https://typesense.helpscoutdocs.com/article/174-gpu-acceleration
Denny
Photo of md5-9ad0e03e3fe5fac2f75e521afcf2a7d7
Denny
02:57 AM
Ah I see… I’ve noticed the syncing is a lot slower due to this.
02:57
Denny
02:57 AM
When I update a document, using upsert, will it trigger OpenAI?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:58 AM
Only if the value for the embed.from fields specified in the schema change
02:59
Jason
02:59 AM
Btw, you want to make sure you use the import endpoint with as many as 1K documents per API call, so the call to openai is also a bulk embedding call

1