Hello folks, we were running some load tests, and ...
# community-help
h
Hello folks, we were running some load tests, and what we have noticed is that once we drop a collection in search which has around ~4-5GB data around 10M documents, the next run of write operations(upserts) are very slow. Did you folks notice this as well or are any known limitations?
k
👋 Can you clarify what you mean by
drop
? Do you delete the collection, then recreate it and then repopulate it?
h
that’s right, so create a collection, load the data and then delete the collection and repeat.
k
Does it happens right after the first delete or gradually over time when you keep doing this?
h
Usually, if the collection is around ~4-5GB or greater then we are seeing it in the next iteration.
k
Technically this shouldn't happen because many users do use delete + re-create to repopulate collection with the help of an alias for switching it.
Have you tried using an alias and then creating a collection under a different name?
h
no, didn’t try that option but thanks for the suggestion will take a look at it but based on your experience this should be fine and even a collection of size 10-20GB should be fine with this pattern?
k
I have not explicitly tried benchmarking either of the approaches but many users do use the alias approach. I also wonder if the slow down happens because of the huge number of deletes on the original collection namespace in RocksDB, which when you try to insert into again might be slow. However, this is just an hypothesis. If the alias approach works, then this is likely to be the reason.
h
That’s my hunch as well if something in RocksDB is triggering the slowdown, let me see if it is easy to try the alias approach, and will let you know how it goes.
👍 1
k
Regardless of the alias, the idea is to see if the issue happens when the writes happen on a different collection name, instead of reusing the collection name. Alias is just a convenience for not having to update client code if a new collection name has to be used.
👍 1
h
Sounds good.
Hey, @Kishore Nallan tried again, and irrespective of collection the new inserts are very slow. The steps are - insert around 10M documents, drop the collection and after that, any new insert on any collection is very very slow. But I am not sure how to debug it further there are no error logs, and raft seems to be doing fine. CPU is also stable.
actually CPU is high.
We have also opened an issue for this https://github.com/typesense/typesense/issues/679
k
I'll take a look.
h
thanks!
k
Does this whole test run inside an automated test suite or do you run it manually? Asking because I wonder if it might be related to https://github.com/typesense/typesense/issues/678
h
It is an automated test suite, where we have ycsb benchmarking running against a service that has typesense integrated with it. We are trying to work on a standalone test case that can reproduce it by directly running it against typesense.
but it is not local which is what the other issue is about, our setup is in AWS and everything is running inside docker.
k
I see. Does the cluster ever recover after that? Does the cpu usage reduce?
I have to run this against a large enough dataset to test but so far I could not reproduce this locally on a ~500M dataset.
h
CPU drops when there is no traffic, but as soon as we do write on any collection the CPU spikes and we see very low throughput.
k
I don't even notice a trifle increase in import time after several runs.
h
and each writes takes ~300-400msec (this is when I am issuing curl directly against Typesense in a single thread)
k
You are using the
import
endpoint?
h
yes in our tests, but when I am trying to issue curl just to see if the latencies are normal then I just do this,
Copy code
for i in {1..100}; do curl "<http://localhost:8108//collections/companies/documents?action=upsert>" -d '{"company_name": "user6284786258713890", "num_employees": 1, "country": "ddddd"}'; done
k
Are you able to reproduce it on smaller set? Maybe 1M docs, ~250 MB dataset size? That would be ideal for me to debug locally.
h
We are trying to see if we can reproduce it against typesense container running locally, let me get back to you on this reproducible set. But some other notes are, that it is a single thread that is hogging all the CPU when we start the inserts again, I am also suspecting RocksDB scan (in batch indexer) that is making it slow but not sure.
but we didn’t see any high IO.
k
Okay more questions, bear with me 🙂 did you wait fully for the delete to finish? Delete can take some time for large collections.
h
sure 🙂 - so we saw a high IO after delete(for around 10 minutes), we didn’t stop immediately but we saw a lot of context deadlines which was expected. But the issue persisted even after 7-8 hours, even though there is no IO.
Even now, when I try to do insert, I see high CPU and latency.
k
Got it, I will wait for the local dataset reproduction result.
o
@pboros might be able to share some results from local testing.
What happens if I run it, 10 documents are inserted in the collection in around 50ms each, after that, bulk upsert 10M documents in coll2
From my testing, the delete() call on coll2 always results in an exception.
If I start the test right after that, the upsert response time in coll1 (after dropping that as well), is increase from around 50ms to around 15 sec on my local machine
k
The default client timeout is probably too small for delete call to finish. You've to certainly bump up the default.
p
Will do that
Is there any bulk delete similar to bulk import?
k
Have to improve bulk delete performance,
I will be running this and checking locally. Did the issue happen even when you set batch_size to a lower value say, 1000?
I mean, 10,000
p
I can check that for you, based on poking around it seems proportional with the data size
k
Btw, if you are not using a high time out for the client, what's essentially happening is that you are now inserting DURING collection deletion.
Anyway enough questions, I will try it out and get back to you.
p
I will set the timeout to higher
Is there any way to tell when the deletion is finished?
k
Check /collections to see if collection exists. The documents are first deleted and the collection list is updated right at the last.
p
I set the timeout to 5 minutes, after around 2, I get the following exception
Copy code
2022-07-28 14:56:52.331753 imported a batch
2022-07-28 14:56:52.331795 starting to delete coll2
Traceback (most recent call last):
  File "/Users/pboros/ts_test.py", line 111, in <module>
    main()
  File "/Users/pboros/ts_test.py", line 86, in main
    client.collections['coll2'].delete()
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/collection.py", line 25, in delete
    return self.api_call.delete(self._endpoint_path())
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/api_call.py", line 156, in delete
    return self.make_request(requests.delete, endpoint, True,
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/api_call.py", line 113, in make_request
    raise ApiCall.get_exception(r.status_code)(r.status_code, error_message)
typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `coll2` found.
Copy code
In [4]: [c['name'] for c in client.collections.retrieve()]
Out[4]: ['coll1']
But the collection is no longer there, so maybe this is just an issue with the python client
o
@pboros the other issue we saw is that inserts to any other collection while the delete is running get stalled as well, right?
p
@Ovais Tariq Right, any insert slows down if there was a deletion. I am adjusting the test case now to make sure that the collection is gone before proceeding, and see if the problem stays like that. I will update the gist once I have that.
o
I also see that you are doing scan and delete, is there a reason for not using delete range?
Copy code
while(iter->Valid() && iter->key().starts_with(del_override_prefix)) {
            store->remove(iter->key().ToString());
            iter->Next();
        }
2
k
Just legacy code 😅
o
🙂
p
@Kishore Nallan I am updating the testing script, so the issue is a lot more visible there.
o
@Kishore Nallan the slow down in subsequent inserts may be due to scan and delete. It would be interesting to see the improvement when using deleteRange
p
I can confirm that the slowdown is there even if I wait until the collection name disappears, I will update the test script in the gist shortly
I can also confirm that new collections will remain fast
k
Thanks, I will check out the gist tomorrow. Feel free to post additional updates on this thread, meanwhile.
p
Will do thanks. I will also be available during your daytime, I am in EU if you have any questions.
👍 1
k
I'm able to reproduce the slow insertion into coll1 post deletion as found in the logs of the first run. I will look into what's happening. Btw, the
connection_timeout_seconds
configuration in the client init is wrong: it should be on the same level as
api_key
and not inside
nodes
. This might explain the other `A collection with name
coll1
already exists.` error.
p
Thank you, I am going to look at that.
In the subsequent runs, it's interesting that batch inserts are degrading as well, but a lot less.
Instead of 3 minutes, it takes 5-ish, I guess it's because the overhead is per transaction, and the actual inserts will take a longer time there, so the overall overhead % is less.
So the small inserts are 100 transactions, the batch is 100 transactions as well, and the additional time to the operation is the same around 1.5 min-ish.
k
After fixing the connection time out, do you get that collection is already found error?
p
I am running that test now, not yet, fingers crossed.
k
coll1
is also hard coded here, so all small batch writes are going to only coll1 in this test: https://gist.github.com/pboros/8be8fa11f9f7a1e10d1a49bd46c1411e#file-ts_test-py-L57 Doesn't change anything much, just fyi.
p
Great catch, thank you!
I updated the gist with these if anyone else wants to take a look
With the timeout in the proper place, the collection does not exist error no longer showing up
it was intermittent, but by this time it should show itself
k
Cool. I will keep you posted on what I find regarding the slow down. What I know now is that the slow down is not just related to the collection write end-point. Even the following is slow:
Copy code
for i in `seq 1 100`; do curl "<http://localhost:8108/health>" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"; done
All writes seem to be slow, which suggests some common bottle neck. Reads are fine (replace POST with GET in the command above)
p
After deleting the coll2, inserting into coll3 (a new collection) is not slow for me.
After fixing the small load not being hardcoded 🙂
What we saw is that there is no io, during this, as soon as the inserts are over, resource utilization goes to 0, firing even some inserts would peg a single cpu core with a single thread
The instance is not able to recover from this even after 6 hours-ish, which should be enough for anything.
The next thing I would like to check if I can produce this at all with using random collection names every time.
k
Are you still seeing CPU being pegged now, after the change to the timeout?
p
Yes
Copy code
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.8 us,  0.9 sy,  0.0 ni, 87.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16395664 total, 13430728 free,   319760 used,  2645176 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used. 15374388 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0 2759544  73112  21612 S 100.6  0.4   8:04.90 typesense-serve
  318 root      20   0   18240   3312   2840 S   0.0  0.0   0:00.02 bash
  328 root      20   0   36644   3052   2624 R   0.0  0.0   0:00.00 top
This is during the slow inserts
k
During
load_small
?
p
Yes
k
I'm not seeing any CPU peg, but I switched to testing on local build (non-Docker) once I saw the slow down. So Docker is the other variable here, I will get back to it once I get a handle of the slow on local first.
p
Sounds, great, I will see if I can reproduce it at all if every collection name is random. Let me know if I can help in any way.
Testing with random names shows something interesting. If I create a new collection after I did the large delete, that shows the regression, but the collection I created before does not. https://gist.github.com/pboros/0583b7e32e658479e4175bf4f3e87a1c If I create the 4th collection at the time of startup, that does not show the regression either. https://gist.github.com/pboros/47c09242bdd9052e2bd98d88897741d6 Seems like all collections created after the large delete have the issue, but let me test this a bit more to confirm it.
👍 1
k
When I run the program both on native Mac binary and on the same 0.23.0 Docker image on Mac, while I see the slow writes, I don't see high cpu usage. Here's the result of `docker stats`:
Copy code
CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
07cb5d874739   dazzling_varahamihira   0.38%     141.9MiB / 3.841GiB   3.61%     516MB / 2.46MB   371MB / 27.9GB   231
p
How about inside the container? The high cpu usage is only for the duration of the slow, small inserts
k
But you said it persists for 6+ hours right?
p
I am also working on getting to know more: I created 100 collections, did small load on them, created big collections and dropped it, created 10 small collections afterwards: all is fine
That's the slowness and not the CPU usage, CPU usage is only there during the inserts
If I create 100 collections, run small load on 50, create the big one, delete it, than delete 50 small collections. After that run small load on the remaining 50 collections, they will be slow.
k
As for "slow" write I confirmed that the actual time taken to process the write inside the Typesense indexing engine is literally 0ms. The problem appears to be in the batch indexer, where a thread is woken up to process the write. I still don't know what's happening there but once large number of docs are indexed, this thread wake up takes 1 second to happen 🤔 The same issue does not happen if I index 1M docs. So I have to still figure that out, but persistent high CPU does not happen.
p
@Kishore Nallan We also tested that, without dropping involved, we could index 10s os millions of docs without triggering the slowdown
So there is a sleep somewhere in the thread woken up? The cpu utilization can be just waiting on mutexes for that woken up thread, how much of it is spin wait can be dependent on the os/environment
k
No, a condition variable is used so it must be instant, Here's where the thread associated with the queue is notified: https://github.com/typesense/typesense/blob/main/src/batched_indexer.cpp#L67
It almost feel like OS is running short of some resources, like maximum threads or something so the mechanism of waking up changes at that point.
Two other things: 1. All POST end-points (e.g. POST /health) which uses the batch indexer suffer from the same problem, so this is unrelated to the underlying engine. 2. Collection created before/after the drop all suffer from the same problem. I am going to now investigate ulimit etc.
And yes, I do see high CPU usage during those slow inserts, but even hitting POST /health causes high cpu usage!
p
I tried to check ulimits, but did not see anything obvious there on my side.
Copy code
root@45652a6758b7:/# cat /proc/1/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63590                63590                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
k
All of a sudden I'm unable to reproduce it 🤷 Would it be possible for you to run
strace
during the high CPU usage and get a capture of it? On Mac/Docker strace does not seem to work. Otherwise I have to launch a Linux instance to try later.
p
Yes, we did that already, the thread that is eating up the cpu spends it's time in pread64
I will grab an strace for you
With
top -H
you can see what thread it is, in my case now it's 231:
Copy code
231 root      20   0 2751372  79668  21612 R 99.9  0.5   2:21.33 typesense-serve
    7 root      20   0 2751372  79668  21612 S  0.3  0.5   0:02.26 typesense-serve
One moment, I need more privileges in the container
I can share the one that is reproducing it with random collection names as well
k
Yeah that's fine, it should be the same issue.
The other thing we could do is to gdb attach to that thread to see what exactly that piece of code is. Pretty sure it's a batch indexer queue wake up thread, but not sure if we can confirm it based on symbols.
p
Will do that.
k
Another downright strange behavior I'm seeing now is that the curl version of the query not slowing but the python call hanging:
Copy code
# Does not slow down
for i in `seq 2 100`; do curl "<http://localhost:8108/collections/coll1/documents?action=upsert>" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d "{\"id\": \"$i\", \"field1\": \"foo\", \"field2\": \"bar\"}"; done
This slows down and pegs CPU:
Copy code
client = get_client()
load_small(client, "coll1", 100)
p
Originally we had the issue with the go driver, I wrote the example in python because I saw that's an official one
k
Scratch that, 2 different processes. I am going to take a break haha
p
🙂
With strace I do see the stalls, when it's in read calls I see things resembling the documents
Copy code
root@3942c8bd03f4:/# strace -p 232 2>&1 | tee -a strace.out
strace: Process 232 attached
futex(0x7fbc2c4ea0cc, FUTEX_WAIT_PRIVATE, 207, NULL
The thread that is eating up the cpu is actually changing, I managed to get some strace from one of those
Here is a aprt of that
k
Thanks will check
p
Thread dump from the thread that was eating up the cpu: https://gist.github.com/pboros/cc8a93d4b9876abd7bdb5c81e080f2da
Seems to be rocksdb, which is matching the read calls
k
Ah, that's RocksDB. Wonder if a compaction is running. But I see nothing in the LOG file inside the
db
directory.
But it will kinda explain the whole issue since it happens write after a delete only.
p
I don't think it's enough data for a compaction taking this long
Copy code
root@3942c8bd03f4:/tmp# du -sh db/
77M	db/
And also: no io
k
pread without io?
p
I mean no significant io
Copy code
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    10.00    0.00   15.00     0.00     0.10    13.33     0.01    0.33    0.00    0.33   0.67   1.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.65    0.00    1.00    0.13    0.00   87.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    11.88    0.00   11.88     0.00     0.09    16.00     0.01    0.33    0.00    0.33   0.75   0.89
k
Okay thanks for all the dumps. If I'm not able to get any other clue, I am going to try removing the dumb "delete every single document" RocksDB API call with the delete range.
p
That feels like the right direction
Interesting thing from /stats:
Copy code
{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {
    "GET /health": 0,
    "POST /collections/epnbiqxgrr/documents/": 470.34285714285716,
    "POST /collections/imkwwholpv/documents/": 651
  },
  "pending_write_batches": 1,
  "requests_per_second": {
    "GET /health": 0.1,
    "POST /collections/epnbiqxgrr/documents/": 3.5,
    "POST /collections/imkwwholpv/documents/": 0.5
  },
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 4.1,
  "write_latency_ms": 492.925,
  "write_requests_per_second": 4
}
While the thing is happening, so it's visible in write_latency_ms
k
Locally I have pretty much isolated the actually latency to the condition variable wake up, though. There is a ~1 second difference between notify() and wake up, which is exactly the time taken to insert the document.
Copy code
I20220729 18:33:31.825273 92155904 batched_indexer.cpp:67] batch indexer queue notify
I20220729 18:33:33.282867 87326720 batched_indexer.cpp:137] batch indexer queue awake
It is possible that this is a symptom rather than a cause, maybe RocksDB doing something which is putting pressure on scheduler.
But then again, cpu spikes ONLY during the actual write requests, not otherwise.
p
Right
As soon as there is no insert, there is no cpu burn
k
Let me implement the range delete
🙌 1
p
Are the rocksdb stats available somewhere? I see that based on configuration it should dump every 10 minutes
k
There is a LOG file in the db directory
I've managed to fix this I think. Range delete did not help (but I've made the change anyway since it is better). The gdb dump stack trace began with a call to
rocksdb::DBIter::Next()
in batched indexer, so I figured it is taking too long to do db iterations after dropping so many keys. I did a full db compaction after the deletion of the keys and that seems to have helped. I initially tried range compaction on only the key range that was deleted but that didn't help either. Right now, I'm doing a blind compaction as part of every collection drop, but have to probably change it to happen only when a "large" collection is dropped. Not sure if there is another criteria (like some kind of key fragmentation) to use here.
o
If that's the case there may be further tweaking of RocksDB needed
k
I'm frankly surprised by this behavior. Compaction should be handled by RocksDB if it is deemed necessary. But the stack trace is clear: too much CPU on iteration which causes the scheduler to delay waking up the indexer thread.
p
Rocksdb is a complex beast
Does typesense have a way to take rocksdb configuration?
k
Haven't had the need to expose those yet.
p
I mean to have the user specify rocksdb parameters
k
When deleting lots of rows, some of the SST files might be filled with tombstones and affected range scan performance. We extended RocksDB’s compaction to track long close-by tombstones. If a key range with high density of tombstones is detected, it immediately triggers another compaction in order to compact them away. This helped to reduce range scan performance skews caused by scanning excessed number of tombstones. In RocksDB, CompactOnDeletionCollectorFactory is the right class to do that.
This seems relevant.
o
A large number of deletes can cause this issue that you see in the scan. And this is not something that RocksDB default config is optimized for
k
No way to configure rocksdb via params yet
p
If you expose, I am happy to help with tuning it/benchmarking typesense
👍 1
o
There is a configuration option to trigger compaction automatically based on threshold of tombstones. I will try to find it
h
Kishore, one more data point is, in my local setup when I profiled the slow thread in a couple of cases I see the thread was busy iterating from the RocksDB and spending around 90% CPU, which I saw something similar in the prod thread as well. This means the batch indexer was spending a majority of time iterating and just reading to get the request before even processing it.
k
Yes this is the same stack trace that was caught earlier as well. It happens after the large deletion. Rocksdb probably is forced to skip a lot entries for every iteration. This went away when I issued a force compaction.
h
👍
o
Yup with the large number of deletes scan is going to be slow as it has to skip over the tombstones. That is a known issue
@Kishore Nallan triggering a compaction might have some other side effects as well
k
Using
NewCompactOnDeletionCollectorFactory
as described here seems to be the most generic way to account for this: https://github.com/facebook/rocksdb/blob/7fd68b7c39c76b5741533ee2a4d7024f3b513f2d/include/rocksdb/utilities/table_properties_collectors.h#L85
o
Yup
When are you planning on releasing the fix?
k
I'll be testing this out this week and will have a patch available.
👍 2
Setting
NewCompactOnDeletionCollectorFactory
does not help. New writes post deletion still takes as long: I don't even see an improvement. It's not clear to me when this compaction will kick-in. While I do see compaction related logs while deletion happens, it's not clear if it's related to this and in any case, doesn't help. So far the only thing that helps is an outright full DB compaction right after the deletion. Even compacting the range of the deleted collection keys doesn't help.
o
I am actually surprised that DeleteRange did not help: https://github.com/facebook/rocksdb/wiki/DeleteRange
I am also surprised by this performance issue. Deleting ranges is a common operation that every system using RocksDB has to deal with.
k
Yes that's the first thing I tried: a) Modified deletes to use DeleteRange b) Tried compacting only the deleted range c) Now tried this compaction factory Nothing helped.
This seems to be a known issue, there have been a few posts.
One suggestion I'm yet to try is
read_options.iterate_upper_bound
which has been recommended. I will check that.
o
Both of those two issues suggest range delete.
What I mean that considering how common range deletes are there are not many issues reported.
k
Atleast one other person has mentioned that it didn't help: https://github.com/facebook/rocksdb/issues/5265#issuecomment-488088104
k
Yes that's the factory I mentioned earlier about trying. You have to set it as table options. This is what I did:
Copy code
options.table_properties_collector_factories.emplace_back(
                rocksdb::NewCompactOnDeletionCollectorFactory(10000, 7500, 0.5));
Maybe those values are too aggressive and never triggered the compaction?
I will try with smaller values as well.
o
Yeah could be
Try tweaking the deletion_trigger
k
Finally, I got compactions running by setting
options.level0_file_num_compaction_trigger = 1
(default: 4). Level 0 compaction did not trigger previously because there were not enough level_0 files to trigger that default value. I think if there were sustained writes after the deletion, it would have eventually happened. Need to still understand the implication of setting this value too low but the impact of large number of level-0 files are explained succinctly in this paragraph:
RocksDB level 0 SST files are different from other levels. Memtables are flushed to disk as level 0 SST files. The data in a level 0 SST file are arranged in the order of generation as the memtable, and data from the same regions can be spread across different files. Also, data with the same key can be in multiple SST files. Therefore, when a read occurs, each level 0 SST file must be read in order. The more level 0 SST files, the more likely query read performance will be impacted negatively.
o
That's most likely too aggressive though. As soon as write buffer would be flushed to disk it will trigger a compaction
That note applies when you have many L0 files
k
Agree. This is how the files and levels looked when compaction is not triggered:
Copy code
20M 000906.sst
21M 000909.sst
10M 001039.sst
9.9M 001040.sst
4.0K 001105.sst

L0: 0
L1: 1 (2 KB)
L2: 4 (60 MB)
o
Ok there is no L0 file?
And these are not that many files
k
Yes, I got these from the
LOG
file, which matched the file sizes on disk.
o
Hmm
And how does it look after triggering compaction?
k
After compaction:
Copy code
4.0K	001136.sst
 16K	001138.sst
Nothing logged in LOG file on this run. I've to probably use some rocksdb CLI tool or something to get what levels these are on.
o
Oh wow. Everything got compacted.
k
Yes, Level 1 triggers Level 2 and so on. The
max_bytes_for_level_base
is default of 256M so I am not able to understand how anything would ever end up in L2 because total size ~60M even without compaction.
Probably needs someone who is well versed in RocksDB internals to comment on this, and maybe having a stand-alone reproduceable example will help elucidate these observations.
o
There is some useful info here about how to get stats https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksDB-Tuning-Guide.md
👍 1
Let me look at how rocksdb is configured here
I think the write buffer size of 4MB is too small and will result in large number of L0 files when there is enough write throughput
k
Yes certainly we need to make these RocksDB configurations exposed for tweaking. I've another idea which I am now exploring. It's the
.next()
call which is taking time and not the initial iterator
seek
call. This is because the next call is not bounded and according to another comment on RocksDB GH, setting an upper bound on the iterator will help. This is because the deleted key range exists after the seek's prefix so maybe without an upper bound the iterator keeps looking at all those tombstones that can be skipped (there's an open but abandoned PR to handle this scenario).
o
Hmm
That would be worth exploring
Another option that would be good to be enabled by default: level_compaction_dynamic_level_bytes=true
And probably change the write buffer size to 16M
k
The docs warn about enabling
level_compaction_dynamic_level_bytes
on existing DBs, so need to plan how to do that.
(I mean not for this bug, but as a general default)
o
Yeah
Also not sure if bloom_filter is enabled
That's also worth enabling if not already enabled
k
What I'm going to do is extract this issue as a standalone cpp example so I can iterate faster on these tweaks. Will keep you posted.
👍 1
o
Cool
k
I was able to reproduce this issue on a standalone cpp program. Adding iterator upper_bound helps speed things up (since iteration can end even before the tombstones are encountered) but ran into another bug/issue with the upper bound API. I will have to post on RocksDB mailing list / SO later.
o
Cool, that's great progress
p
Do you mind sharing the standalone program?
k
Here's the gist: https://gist.github.com/kishorenc/4ff763062f694665514d9d8ddcabae52 I've managed to fix the upper bound API issue I ran into earlier: it's because Slice is like a string_view and I was using a stack allocated local. Apart from that, in the code above, you will notice that compaction does not kick in despite using that factory.
I'm now incorporating the change of setting upper bound on the batch indexer iterator which fixes the Next() stall because of the deleted tombstones. However compaction not kicking in is still an issue which I need to raise with the rocksdb folks. Maybe my understanding of how that factory option works is not correct.
I've published
typesense/typesense:0.24.0.rc28
Docker image that contains the iterator upper bound patch that has eliminated the post-bulk-delete write latencies for me locally.
🎉 3
🙌 3
RocksDB folks also clarified about the
NewCompactOnDeletionCollectorFactory
-- this only works for single deletes and not for range deletes. If range delete is used, then the use of factory is obsolete. I've already switched the collection deletion to range delete as well, so I think we should be good now.
👍 1
o
Great.
p
The fix works with our original use case, as well as with my tests.
🙌 1
k
Thank you for all the help in identifying and fixing this!
o
Thank you for prioritizing the issue and fixing it