#community-help

Optimizing Bulk Indexing and Reducing RAM Usage in Typesense

TLDR Timon experienced issues with Typesense becoming unresponsive during bulk indexing and sought advice. Jason recommended larger import requests and adjusting the client-side timeout allowance, revealing a need to increase RAM allocation for Docker. Kishore Nallan undertook to find ways to optimize memory usage, particularly for geopoint indexing.

Powered by Struct AI

1

1

48
22mo
Solved
Join the chat
Dec 15, 2021 (23 months ago)
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
05:07 PM
What is the approach on bulk indexing? I import in batches of 40 documents and after some minutes Typesense gets unresponsive, and I have to clear the container volume. DO i have to throttle requests?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:00 PM
Timon 40 documents per import request is actually pretty small. I'd recommend sending as large as 5K-10K documents per API call
07:00
Jason
07:00 PM
You want to make sure that the client-side timeout is large enough for the API call to complete
07:01
Jason
07:01 PM
I've even tried up to 2.2M documents per API call
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
07:57 PM
Ok now with a batch size of 5k and a higher timeout tolerance I can index 70000 documents, but then typesense refuses requests and the logs show I20211215 19:52:42.399240 257 batched_indexer.cpp:242] Running GC for aborted requests, req map size: 0
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:00 PM
Could you check the amount of free RAM you have and swap usage (should be close to zero)?
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
08:20 PM
I have still have 2gb ram left (maybe it is restricted by windows). I first receive an EOF error and then connection is closed by typesense. the size of the data to index is roughly 1gb
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:20 PM
Are you running via Docker?
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
08:23 PM
yes wsl
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:24 PM
Ah Docker limits RAM by default which you'd have to manually increase via a config setting

1

Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
09:04 PM
Now everything is running smoothly THANKS!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:19 PM
Awesome! Is that the entire OSM dataset? 😃
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
09:26 PM
No haha, first I try openaddresses.io datasets since my server only has 32 ram. Then ill extract Europe from the osm planet and try to index it. with that we could estimate how much ram is needed for the planet

1

Dec 16, 2021 (23 months ago)
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
01:00 PM
94 mil addresses (us-south coverage) with their geopoints takes around 10gb ram
01:10
Timon
01:10 PM
with osm we also have more properties such as categories etc to index, the osm coverage is a bit better than openaddresses for the US, so it should take around 30gb ram for that. I'll try to index openaddresses global dataset and see how that will work out. I think indexing the complete OSM planet is actually not a valid thing for in memory search engines because it will require immense amount of memory.
01:10
Timon
01:10 PM
ill tag Kishore Nallan because he was interested in that topic aswell
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:13 PM
Would you be able to post a small dataset here that you've indexed along with the schema used? I would love to see if we can support some form of optional flash storage.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
01:14 PM
What is small? 😄 I can extract you one city which is a few mb,
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:14 PM
And also the typical type of queries used. Querying OSM data is coming up often in enough in conversations so I'm wondering if there are low hanging fruits to improve.
01:15
Kishore Nallan
01:15 PM
Want to primarily see if memory usage is dominated by geo data or string data.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
01:16 PM
schema is rather simple:
{
  "name": "geo-objects",
  "fields": [
    {
      "name": "address",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true
    },
    {
      "name": "geo_point",
      "type": "geopoint",
      "facet": false,
      "optional": false,
      "index": true
    }
  ],
  "default_sorting_field": ""
`}``
01:23
Timon
01:23 PM
here an example. i transformed the openaddress data to fit the schema.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:24 PM
Okay and what does a typical query look like?
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
01:38 PM
/collections/geo-objects/documents/search/?q=Highway 77&query_by=address&sort_by=geo_point(34.995200, -80.976930):asc classic query: search addresses near a point (e.g. users position)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:42 PM
Thank you for sharing these. I think, at the end of the day it all boils down to latency vs memory (I know, know!). I often think about going down the rabbit hole of supporting on-disk storage as an option, but by then there are already plenty of other databases for that.

But I do wonder how much percentage of the 10 GB RAM is on account of the geo index. Since the geo index is fairly new, there are probably things we can do to optimize that if that takes a much large portion than the text index.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
01:59 PM
You are right, there are tools for that. But tbh, the advantage of typesense is the easy setup and configuration. Currently, evaluating many tools for my use-case and for me, it is safe to say that the time to market with typesense is a huge advantage over elasticearch etc (even though it contains more features and does require less ram due to hot/cold storage).
02:01
Timon
02:01 PM
Let me index the same dataset without the geopoints, and we can see how much memory the geo index takes.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:01 PM
👍
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
02:57 PM
I currently am at 1/6 of the dataset and 500mb allocated. Seems like geopoints are really heavy on memory usage.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:58 PM
Yup, I think I found out a potential way to optimize the way we store geopoints. Might be able to get a you a new build to test in the next few days. 🙂
02:58
Kishore Nallan
02:58 PM
^ Thanks for confirming, I had a strong suspicion.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
02:59 PM
U mind sharing? Is it some optimization in configuring the library or a new technique of storing them.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:00 PM
I'm going to try packing the document IDs associated in the main geo datastructure. I'm currently using std::vector which has very high fixed overhead (on gcc, about 24 bytes) so it will waste a lot of memory when the datastructure is sparse.
03:01
Kishore Nallan
03:01 PM
Either storing as raw array, or as compressed sequence will help a lot.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
03:05 PM
Sounds great. Maybe it is also possible to only index certain rectangles with a fixed size that point to a file on hard storage. With that, it would be possible to only index the rectangle and the corresponding id of the file. A geopoint will always belong to a rectangle.
03:06
Timon
03:06 PM
Now when filtering for documents in a certain geo radius we can load all rectangles that are affected by the radius and run the operations on the points loaded from the disk.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:06 PM
I don't follow here: are you saying that by indexing rectangles instead of individual geo points, we can potentially reduce the total data indexed?
03:07
Kishore Nallan
03:07 PM
Ah okay got it.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
03:07 PM
Yes, this is just some thoughts i have. But I havent done any research on indexing geodata.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:07 PM
2 layer indexing, where the larger "cells" can be used for quick in-memory filtering, but actually do the finer level from disk.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
03:07 PM
so when we have high-dense geodata we only have to index bounding boxes
03:08
Timon
03:08 PM
yes exactly, maybe there is a similar approach for that using r-trees
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:08 PM
Got it. Let's see how much I can squeeze with the datastructure packing change first and then this can be taken up next.
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
03:08 PM
perfect, ping me here when you have a build for that. I will be happy to test it!
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:09 PM
:ty: will do
Dec 30, 2021 (22 months ago)
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
12:22 PM
have you been working on this? Kishore Nallan
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:56 PM
Yes I managed to pack but it only helped in saving about ten percent of memory. So we have to support using on disk kv store if we really want to make a dent.
Dec 31, 2021 (22 months ago)
Timon
Photo of md5-047f1a87663ea0dcc43a01fc89a14f08
Timon
10:55 AM
Having a hybrid solution for this would be wonderful as it also targets people who do have huge datasets but not the amount of memory to index it all. I am really looking forward to this!