#community-help

Slow, High CPU Write Operations After Collection Drop in Typesense

TLDR Himank discussed an issue in Typesense where deleting and recreating a collection led to slow write operations and high CPU usage. Kishore Nallan suggested using an alias to avoid this issue. Numerous tests and debugging was conducted as pboros contributed with local testing. Kishore Nallan aimed to start implementing a range delete and full db compaction after deletion to potentially solve the issue.

Powered by Struct AI
+110
raised_hands5
tada3
heavy_plus_sign2
Jul 29, 2022 (14 months ago)
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
12:58 PM
Seems to be rocksdb, which is matching the read calls
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:59 PM
Ah, that's RocksDB. Wonder if a compaction is running. But I see nothing in the LOG file inside the db directory.
12:59
Kishore Nallan
12:59 PM
But it will kinda explain the whole issue since it happens write after a delete only.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
12:59 PM
I don't think it's enough data for a compaction taking this long
01:00
pboros
01:00 PM
root@3942c8bd03f4:/tmp# du -sh db/
77M    db/
01:00
pboros
01:00 PM
And also: no io
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:00 PM
pread without io?
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
01:00 PM
I mean no significant io
01:01
pboros
01:01 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    10.00    0.00   15.00     0.00     0.10    13.33     0.01    0.33    0.00    0.33   0.67   1.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.65    0.00    1.00    0.13    0.00   87.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    11.88    0.00   11.88     0.00     0.09    16.00     0.01    0.33    0.00    0.33   0.75   0.89
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:01 PM
Okay thanks for all the dumps. If I'm not able to get any other clue, I am going to try removing the dumb "delete every single document" RocksDB API call with the delete range.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
01:02 PM
That feels like the right direction
01:04
pboros
01:04 PM
Interesting thing from /stats:
01:04
pboros
01:04 PM
{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {
    "GET /health": 0.0,
    "POST /collections/epnbiqxgrr/documents/": 470.34285714285716,
    "POST /collections/imkwwholpv/documents/": 651.0
  },
  "pending_write_batches": 1,
  "requests_per_second": {
    "GET /health": 0.1,
    "POST /collections/epnbiqxgrr/documents/": 3.5,
    "POST /collections/imkwwholpv/documents/": 0.5
  },
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 4.1,
  "write_latency_ms": 492.925,
  "write_requests_per_second": 4.0
}
01:04
pboros
01:04 PM
While the thing is happening, so it's visible in write_latency_ms
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:05 PM
Locally I have pretty much isolated the actually latency to the condition variable wake up, though. There is a ~1 second difference between notify() and wake up, which is exactly the time taken to insert the document.
01:06
Kishore Nallan
01:06 PM
I20220729 18:33:31.825273 92155904 batched_indexer.cpp:67] batch indexer queue notify
I20220729 18:33:33.282867 87326720 batched_indexer.cpp:137] batch indexer queue awake

01:07
Kishore Nallan
01:07 PM
It is possible that this is a symptom rather than a cause, maybe RocksDB doing something which is putting pressure on scheduler.
01:07
Kishore Nallan
01:07 PM
But then again, cpu spikes ONLY during the actual write requests, not otherwise.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
01:08 PM
Right
01:08
pboros
01:08 PM
As soon as there is no insert, there is no cpu burn
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:09 PM
Let me implement the range delete
raised_hands1
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
01:11 PM
Are the rocksdb stats available somewhere? I see that based on configuration it should dump every 10 minutes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:13 PM
There is a LOG file in the db directory
03:01
Kishore Nallan
03:01 PM
I've managed to fix this I think. Range delete did not help (but I've made the change anyway since it is better). The gdb dump stack trace began with a call to rocksdb::DBIter::Next() in batched indexer, so I figured it is taking too long to do db iterations after dropping so many keys.

I did a full db compaction after the deletion of the keys and that seems to have helped. I initially tried range compaction on only the key range that was deleted but that didn't help either. Right now, I'm doing a blind compaction as part of every collection drop, but have to probably change it to happen only when a "large" collection is dropped. Not sure if there is another criteria (like some kind of key fragmentation) to use here.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
03:03 PM
If that's the case there may be further tweaking of RocksDB needed
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:06 PM
I'm frankly surprised by this behavior. Compaction should be handled by RocksDB if it is deemed necessary. But the stack trace is clear: too much CPU on iteration which causes the scheduler to delay waking up the indexer thread.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
03:06 PM
Rocksdb is a complex beast
03:07
pboros
03:07 PM
Does typesense have a way to take rocksdb configuration?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:08 PM
Haven't had the need to expose those yet.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
03:08 PM
I mean to have the user specify rocksdb parameters
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:08 PM
> When deleting lots of rows, some of the SST files might be filled with tombstones and affected range scan performance. We extended RocksDB’s compaction to track long close-by tombstones. If a key range with high density of tombstones is detected, it immediately triggers another compaction in order to compact them away. This helped to reduce range scan performance skews caused by scanning excessed number of tombstones. In RocksDB, CompactOnDeletionCollectorFactory is the right class to do that.
This seems relevant.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
03:09 PM
A large number of deletes can cause this issue that you see in the scan. And this is not something that RocksDB default config is optimized for
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:09 PM
No way to configure rocksdb via params yet
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
03:09 PM
If you expose, I am happy to help with tuning it/benchmarking typesense
+11
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
03:09 PM
There is a configuration option to trigger compaction automatically based on threshold of tombstones. I will try to find it
Himank
Photo of md5-3fe4eac7b93049736ec7d380f019d1e5
Himank
04:35 PM
Kishore, one more data point is, in my local setup when I profiled the slow thread in a couple of cases I see the thread was busy iterating from the RocksDB and spending around 90% CPU, which I saw something similar in the prod thread as well. This means the batch indexer was spending a majority of time iterating and just reading to get the request before even processing it.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:58 PM
Yes this is the same stack trace that was caught earlier as well. It happens after the large deletion. Rocksdb probably is forced to skip a lot entries for every iteration. This went away when I issued a force compaction.
Himank
Photo of md5-3fe4eac7b93049736ec7d380f019d1e5
Himank
04:59 PM
👍
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
06:07 PM
Yup with the large number of deletes scan is going to be slow as it has to skip over the tombstones. That is a known issue
06:07
Ovais
06:07 PM
Kishore Nallan triggering a compaction might have some other side effects as well
Aug 01, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:35 PM
Using NewCompactOnDeletionCollectorFactory as described here seems to be the most generic way to account for this: https://github.com/facebook/rocksdb/blob/7fd68b7c39c76b5741533ee2a4d7024f3b513f2d/include/rocksdb/utilities/table_properties_collectors.h#L85
Aug 02, 2022 (13 months ago)
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:44 PM
Yup
01:44
Ovais
01:44 PM
When are you planning on releasing the fix?
Aug 03, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:58 AM
I'll be testing this out this week and will have a patch available.
+12
Aug 05, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:28 AM
Setting NewCompactOnDeletionCollectorFactory does not help. New writes post deletion still takes as long: I don't even see an improvement. It's not clear to me when this compaction will kick-in. While I do see compaction related logs while deletion happens, it's not clear if it's related to this and in any case, doesn't help. So far the only thing that helps is an outright full DB compaction right after the deletion. Even compacting the range of the deleted collection keys doesn't help.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:13 PM
I am actually surprised that DeleteRange did not help: https://github.com/facebook/rocksdb/wiki/DeleteRange
01:14
Ovais
01:14 PM
I am also surprised by this performance issue. Deleting ranges is a common operation that every system using RocksDB has to deal with.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:14 PM
Yes that's the first thing I tried:

a) Modified deletes to use DeleteRange
b) Tried compacting only the deleted range
c) Now tried this compaction factory

Nothing helped.
01:14
Kishore Nallan
01:14 PM
This seems to be a known issue, there have been a few posts.
01:15
Kishore Nallan
01:15 PM
01:18
Kishore Nallan
01:18 PM
One suggestion I'm yet to try is read_options.iterate_upper_bound which has been recommended. I will check that.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:20 PM
Both of those two issues suggest range delete.
01:20
Ovais
01:20 PM
What I mean that considering how common range deletes are there are not many issues reported.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:21 PM
Atleast one other person has mentioned that it didn't help: https://github.com/facebook/rocksdb/issues/5265#issuecomment-488088104
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:26 PM
Yes that's the factory I mentioned earlier about trying. You have to set it as table options. This is what I did:

options.table_properties_collector_factories.emplace_back(
                rocksdb::NewCompactOnDeletionCollectorFactory(10000, 7500, 0.5));
01:28
Kishore Nallan
01:28 PM
Maybe those values are too aggressive and never triggered the compaction?
01:28
Kishore Nallan
01:28 PM
I will try with smaller values as well.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:28 PM
Yeah could be
01:32
Ovais
01:32 PM
Try tweaking the deletion_trigger
Aug 06, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:11 AM
Finally, I got compactions running by setting options.level0_file_num_compaction_trigger = 1 (default: 4). Level 0 compaction did not trigger previously because there were not enough level_0 files to trigger that default value. I think if there were sustained writes after the deletion, it would have eventually happened. Need to still understand the implication of setting this value too low but the impact of large number of level-0 files are explained succinctly in this paragraph:

> RocksDB level 0 SST files are different from other levels. Memtables are flushed to disk as level 0 SST files. The data in a level 0 SST file are arranged in the order of generation as the memtable, and data from the same regions can be spread across different files. Also, data with the same key can be in multiple SST files. Therefore, when a read occurs, each level 0 SST file must be read in order. The more level 0 SST files, the more likely query read performance will be impacted negatively.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:02 PM
That's most likely too aggressive though. As soon as write buffer would be flushed to disk it will trigger a compaction
01:03
Ovais
01:03 PM
That note applies when you have many L0 files
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:10 PM
Agree. This is how the files and levels looked when compaction is not triggered:

20M 000906.sst
21M 000909.sst
10M 001039.sst
9.9M 001040.sst
4.0K 001105.sst

L0: 0
L1: 1 (2 KB)
L2: 4 (60 MB)
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:12 PM
Ok there is no L0 file?
01:12
Ovais
01:12 PM
And these are not that many files
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:14 PM
Yes, I got these from the LOG file, which matched the file sizes on disk.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:14 PM
Hmm
01:14
Ovais
01:14 PM
And how does it look after triggering compaction?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:15 PM
After compaction:

4.0K    001136.sst
 16K    001138.sst

Nothing logged in LOG file on this run. I've to probably use some rocksdb CLI tool or something to get what levels these are on.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:16 PM
Oh wow. Everything got compacted.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:17 PM
Yes, Level 1 triggers Level 2 and so on. The max_bytes_for_level_base is default of 256M so I am not able to understand how anything would ever end up in L2 because total size ~60M even without compaction.
01:20
Kishore Nallan
01:20 PM
Probably needs someone who is well versed in RocksDB internals to comment on this, and maybe having a stand-alone reproduceable example will help elucidate these observations.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:23 PM
There is some useful info here about how to get stats https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksDB-Tuning-Guide.md
+11
01:49
Ovais
01:49 PM
Let me look at how rocksdb is configured here
01:50
Ovais
01:50 PM
I think the write buffer size of 4MB is too small and will result in large number of L0 files when there is enough write throughput
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:53 PM
Yes certainly we need to make these RocksDB configurations exposed for tweaking.

I've another idea which I am now exploring. It's the .next() call which is taking time and not the initial iterator seek call. This is because the next call is not bounded and according to another comment on RocksDB GH, setting an upper bound on the iterator will help. This is because the deleted key range exists after the seek's prefix so maybe without an upper bound the iterator keeps looking at all those tombstones that can be skipped (there's an open but abandoned PR to handle this scenario).
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
01:58 PM
Hmm
01:59
Ovais
01:59 PM
That would be worth exploring
02:02
Ovais
02:02 PM
Another option that would be good to be enabled by default: level_compaction_dynamic_level_bytes=true
02:02
Ovais
02:02 PM
And probably change the write buffer size to 16M
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:02 PM
The docs warn about enabling level_compaction_dynamic_level_bytes on existing DBs, so need to plan how to do that.
02:03
Kishore Nallan
02:03 PM
(I mean not for this bug, but as a general default)
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
02:08 PM
Yeah
02:18
Ovais
02:18 PM
Also not sure if bloom_filter is enabled
02:18
Ovais
02:18 PM
That's also worth enabling if not already enabled
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:28 PM
What I'm going to do is extract this issue as a standalone cpp example so I can iterate faster on these tweaks. Will keep you posted.
+11
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
02:30 PM
Cool
Aug 08, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:29 PM
I was able to reproduce this issue on a standalone cpp program. Adding iterator upper_bound helps speed things up (since iteration can end even before the tombstones are encountered) but ran into another bug/issue with the upper bound API. I will have to post on RocksDB mailing list / SO later.
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
06:04 PM
Cool, that's great progress
Aug 09, 2022 (13 months ago)
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
06:43 AM
Do you mind sharing the standalone program?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:47 AM
Here's the gist: https://gist.github.com/kishorenc/4ff763062f694665514d9d8ddcabae52

I've managed to fix the upper bound API issue I ran into earlier: it's because Slice is like a string_view and I was using a stack allocated local. Apart from that, in the code above, you will notice that compaction does not kick in despite using that factory.
06:49
Kishore Nallan
06:49 AM
I'm now incorporating the change of setting upper bound on the batch indexer iterator which fixes the Next() stall because of the deleted tombstones. However compaction not kicking in is still an issue which I need to raise with the rocksdb folks. Maybe my understanding of how that factory option works is not correct.
01:29
Kishore Nallan
01:29 PM
I've published typesense/typesense:0.24.0.rc28 Docker image that contains the iterator upper bound patch that has eliminated the post-bulk-delete write latencies for me locally.
raised_hands3
tada3
Aug 10, 2022 (13 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:50 AM
RocksDB folks also clarified about the NewCompactOnDeletionCollectorFactory -- this only works for single deletes and not for range deletes. If range delete is used, then the use of factory is obsolete. I've already switched the collection deletion to range delete as well, so I think we should be good now.
+11
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
03:26 PM
Great.
pboros
Photo of md5-faf0fdba0b6739a6706f05c15b6738c6
pboros
03:30 PM
The fix works with our original use case, as well as with my tests.
raised_hands1
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:38 PM
Thank you for all the help in identifying and fixing this!
Ovais
Photo of md5-596a0e810042199da0512d97619528ac
Ovais
04:06 PM
Thank you for prioritizing the issue and fixing it