Hello folks we were running some load tests and what we have typesense #community-help

Hello folks, we were running some load tests, and ...

Himank Chaudhary

07/20/2022, 8:04 AM

Hello folks, we were running some load tests, and what we have noticed is that once we drop a collection in search which has around ~4-5GB data around 10M documents, the next run of write operations(upserts) are very slow. Did you folks notice this as well or are any known limitations?

Kishore Nallan

07/20/2022, 8:08 AM

👋 Can you clarify what you mean by

drop

? Do you delete the collection, then recreate it and then repopulate it?

Himank Chaudhary

07/20/2022, 8:10 AM

that’s right, so create a collection, load the data and then delete the collection and repeat.

Kishore Nallan

07/20/2022, 8:11 AM

Does it happens right after the first delete or gradually over time when you keep doing this?

Himank Chaudhary

07/20/2022, 8:13 AM

Usually, if the collection is around ~4-5GB or greater then we are seeing it in the next iteration.

Kishore Nallan

07/20/2022, 8:14 AM

Technically this shouldn't happen because many users do use delete + re-create to repopulate collection with the help of an alias for switching it.

Kishore Nallan

07/20/2022, 8:14 AM

Have you tried using an alias and then creating a collection under a different name?

Himank Chaudhary

07/20/2022, 8:17 AM

no, didn’t try that option but thanks for the suggestion will take a look at it but based on your experience this should be fine and even a collection of size 10-20GB should be fine with this pattern?

Kishore Nallan

07/20/2022, 8:19 AM

I have not explicitly tried benchmarking either of the approaches but many users do use the alias approach. I also wonder if the slow down happens because of the huge number of deletes on the original collection namespace in RocksDB, which when you try to insert into again might be slow. However, this is just an hypothesis. If the alias approach works, then this is likely to be the reason.

Himank Chaudhary

07/20/2022, 8:21 AM

That’s my hunch as well if something in RocksDB is triggering the slowdown, let me see if it is easy to try the alias approach, and will let you know how it goes.

👍 1

Kishore Nallan

07/20/2022, 8:22 AM

Regardless of the alias, the idea is to see if the issue happens when the writes happen on a different collection name, instead of reusing the collection name. Alias is just a convenience for not having to update client code if a new collection name has to be used.

👍 1

Himank Chaudhary

07/20/2022, 8:24 AM

Sounds good.

Himank Chaudhary

07/27/2022, 5:39 PM

Hey, @Kishore Nallan tried again, and irrespective of collection the new inserts are very slow. The steps are - insert around 10M documents, drop the collection and after that, any new insert on any collection is very very slow. But I am not sure how to debug it further there are no error logs, and raft seems to be doing fine. CPU is also stable.

Himank Chaudhary

07/27/2022, 5:40 PM

actually CPU is high.

Himank Chaudhary

07/27/2022, 7:35 PM

We have also opened an issue for this https://github.com/typesense/typesense/issues/679

Kishore Nallan

07/28/2022, 12:41 AM

I'll take a look.

Himank Chaudhary

07/28/2022, 1:03 AM

thanks!

Kishore Nallan

07/28/2022, 1:25 AM

Does this whole test run inside an automated test suite or do you run it manually? Asking because I wonder if it might be related to https://github.com/typesense/typesense/issues/678

Himank Chaudhary

07/28/2022, 1:27 AM

It is an automated test suite, where we have ycsb benchmarking running against a service that has typesense integrated with it. We are trying to work on a standalone test case that can reproduce it by directly running it against typesense.

Himank Chaudhary

07/28/2022, 1:28 AM

but it is not local which is what the other issue is about, our setup is in AWS and everything is running inside docker.

Kishore Nallan

07/28/2022, 1:29 AM

I see. Does the cluster ever recover after that? Does the cpu usage reduce?

Kishore Nallan

07/28/2022, 1:30 AM

I have to run this against a large enough dataset to test but so far I could not reproduce this locally on a ~500M dataset.

Himank Chaudhary

07/28/2022, 1:30 AM

CPU drops when there is no traffic, but as soon as we do write on any collection the CPU spikes and we see very low throughput.

Kishore Nallan

07/28/2022, 1:30 AM

I don't even notice a trifle increase in import time after several runs.

Himank Chaudhary

07/28/2022, 1:31 AM

and each writes takes ~300-400msec (this is when I am issuing curl directly against Typesense in a single thread)

Kishore Nallan

07/28/2022, 1:32 AM

You are using the

import

endpoint?

Himank Chaudhary

07/28/2022, 1:33 AM

yes in our tests, but when I am trying to issue curl just to see if the latencies are normal then I just do this,

Copy code

for i in {1..100}; do curl "<http://localhost:8108//collections/companies/documents?action=upsert>" -d '{"company_name": "user6284786258713890", "num_employees": 1, "country": "ddddd"}'; done

Kishore Nallan

07/28/2022, 1:35 AM

Are you able to reproduce it on smaller set? Maybe 1M docs, ~250 MB dataset size? That would be ideal for me to debug locally.

Himank Chaudhary

07/28/2022, 1:38 AM

We are trying to see if we can reproduce it against typesense container running locally, let me get back to you on this reproducible set. But some other notes are, that it is a single thread that is hogging all the CPU when we start the inserts again, I am also suspecting RocksDB scan (in batch indexer) that is making it slow but not sure.

Himank Chaudhary

07/28/2022, 1:38 AM

but we didn’t see any high IO.

Kishore Nallan

07/28/2022, 1:55 AM

Okay more questions, bear with me 🙂 did you wait fully for the delete to finish? Delete can take some time for large collections.

Himank Chaudhary

07/28/2022, 1:59 AM

sure 🙂 - so we saw a high IO after delete(for around 10 minutes), we didn’t stop immediately but we saw a lot of context deadlines which was expected. But the issue persisted even after 7-8 hours, even though there is no IO.

Himank Chaudhary

07/28/2022, 1:59 AM

Even now, when I try to do insert, I see high CPU and latency.

Kishore Nallan

07/28/2022, 2:24 AM

Got it, I will wait for the local dataset reproduction result.

Ovais Tariq

07/28/2022, 1:43 PM

@pboros might be able to share some results from local testing.

pboros

07/28/2022, 1:45 PM

https://gist.github.com/pboros/8be8fa11f9f7a1e10d1a49bd46c1411e

pboros

07/28/2022, 1:45 PM

What happens if I run it, 10 documents are inserted in the collection in around 50ms each, after that, bulk upsert 10M documents in coll2

pboros

07/28/2022, 1:45 PM

From my testing, the delete() call on coll2 always results in an exception.

pboros

07/28/2022, 1:46 PM

If I start the test right after that, the upsert response time in coll1 (after dropping that as well), is increase from around 50ms to around 15 sec on my local machine

Kishore Nallan

07/28/2022, 1:48 PM

The default client timeout is probably too small for delete call to finish. You've to certainly bump up the default.

pboros

07/28/2022, 1:48 PM

Will do that

pboros

07/28/2022, 1:50 PM

Is there any bulk delete similar to bulk import?

Kishore Nallan

07/28/2022, 1:51 PM

Have to improve bulk delete performance,

Kishore Nallan

07/28/2022, 1:51 PM

I will be running this and checking locally. Did the issue happen even when you set batch_size to a lower value say, 1000?

Kishore Nallan

07/28/2022, 1:51 PM

I mean, 10,000

pboros

07/28/2022, 1:52 PM

I can check that for you, based on poking around it seems proportional with the data size

Kishore Nallan

07/28/2022, 1:52 PM

Btw, if you are not using a high time out for the client, what's essentially happening is that you are now inserting DURING collection deletion.

Kishore Nallan

07/28/2022, 1:53 PM

Anyway enough questions, I will try it out and get back to you.

pboros

07/28/2022, 1:53 PM

I will set the timeout to higher

pboros

07/28/2022, 1:53 PM

Is there any way to tell when the deletion is finished?

Kishore Nallan

07/28/2022, 1:54 PM

Check /collections to see if collection exists. The documents are first deleted and the collection list is updated right at the last.

pboros

07/28/2022, 1:59 PM

I set the timeout to 5 minutes, after around 2, I get the following exception

pboros

07/28/2022, 2:00 PM

Copy code

2022-07-28 14:56:52.331753 imported a batch
2022-07-28 14:56:52.331795 starting to delete coll2
Traceback (most recent call last):
  File "/Users/pboros/ts_test.py", line 111, in <module>
    main()
  File "/Users/pboros/ts_test.py", line 86, in main
    client.collections['coll2'].delete()
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/collection.py", line 25, in delete
    return self.api_call.delete(self._endpoint_path())
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/api_call.py", line 156, in delete
    return self.make_request(requests.delete, endpoint, True,
  File "/Users/pboros/.virtualenvs/ts/lib/python3.9/site-packages/typesense/api_call.py", line 113, in make_request
    raise ApiCall.get_exception(r.status_code)(r.status_code, error_message)
typesense.exceptions.ObjectNotFound: [Errno 404] No collection with name `coll2` found.

pboros

07/28/2022, 2:00 PM

Copy code

In [4]: [c['name'] for c in client.collections.retrieve()]
Out[4]: ['coll1']

pboros

07/28/2022, 2:00 PM

But the collection is no longer there, so maybe this is just an issue with the python client

Ovais Tariq

07/28/2022, 2:26 PM

@pboros the other issue we saw is that inserts to any other collection while the delete is running get stalled as well, right?

pboros

07/28/2022, 2:28 PM

@Ovais Tariq Right, any insert slows down if there was a deletion. I am adjusting the test case now to make sure that the collection is gone before proceeding, and see if the problem stays like that. I will update the gist once I have that.

Ovais Tariq

07/28/2022, 2:29 PM

I also see that you are doing scan and delete, is there a reason for not using delete range?

Copy code

while(iter->Valid() && iter->key().starts_with(del_override_prefix)) {
            store->remove(iter->key().ToString());
            iter->Next();
        }

➕ 2

Ovais Tariq

07/28/2022, 2:30 PM

http://rocksdb.org/blog/2018/11/21/delete-range.html

Kishore Nallan

07/28/2022, 2:33 PM

Just legacy code 😅

Ovais Tariq

07/28/2022, 3:08 PM

🙂

pboros

07/28/2022, 3:08 PM

@Kishore Nallan I am updating the testing script, so the issue is a lot more visible there.

Ovais Tariq

07/28/2022, 3:10 PM

@Kishore Nallan the slow down in subsequent inserts may be due to scan and delete. It would be interesting to see the improvement when using deleteRange

pboros

07/28/2022, 3:10 PM

I can confirm that the slowdown is there even if I wait until the collection name disappears, I will update the test script in the gist shortly

pboros

07/28/2022, 3:11 PM

I can also confirm that new collections will remain fast

Kishore Nallan

07/28/2022, 3:11 PM

Thanks, I will check out the gist tomorrow. Feel free to post additional updates on this thread, meanwhile.

pboros

07/28/2022, 3:13 PM

Will do thanks. I will also be available during your daytime, I am in EU if you have any questions.

👍 1

pboros

07/28/2022, 7:39 PM

Script to reproduce: https://gist.github.com/pboros/8be8fa11f9f7a1e10d1a49bd46c1411e Output on an empty container: https://gist.github.com/pboros/972a03af7fe5bfb1b6dc9d4a96a351b6 after running it a couple of times, you can eventually get: https://gist.github.com/pboros/fcde6e4e80977004df21851e6b87df8e

Kishore Nallan

07/29/2022, 6:09 AM

I'm able to reproduce the slow insertion into coll1 post deletion as found in the logs of the first run. I will look into what's happening. Btw, the

connection_timeout_seconds

configuration in the client init is wrong: it should be on the same level as

api_key

and not inside

nodes

. This might explain the other `A collection with name

coll1

already exists.` error.

pboros

07/29/2022, 6:27 AM

Thank you, I am going to look at that.

pboros

07/29/2022, 6:40 AM

In the subsequent runs, it's interesting that batch inserts are degrading as well, but a lot less.

pboros

07/29/2022, 6:41 AM

Instead of 3 minutes, it takes 5-ish, I guess it's because the overhead is per transaction, and the actual inserts will take a longer time there, so the overall overhead % is less.

pboros

07/29/2022, 6:42 AM

So the small inserts are 100 transactions, the batch is 100 transactions as well, and the additional time to the operation is the same around 1.5 min-ish.

Kishore Nallan

07/29/2022, 6:42 AM

After fixing the connection time out, do you get that collection is already found error?

pboros

07/29/2022, 6:42 AM

I am running that test now, not yet, fingers crossed.

Kishore Nallan

07/29/2022, 6:57 AM

coll1

is also hard coded here, so all small batch writes are going to only coll1 in this test: https://gist.github.com/pboros/8be8fa11f9f7a1e10d1a49bd46c1411e#file-ts_test-py-L57 Doesn't change anything much, just fyi.

pboros

07/29/2022, 7:15 AM

Great catch, thank you!

pboros

07/29/2022, 7:15 AM

I updated the gist with these if anyone else wants to take a look

pboros

07/29/2022, 7:17 AM

With the timeout in the proper place, the collection does not exist error no longer showing up

pboros

07/29/2022, 7:17 AM

it was intermittent, but by this time it should show itself

Kishore Nallan

07/29/2022, 7:19 AM

Cool. I will keep you posted on what I find regarding the slow down. What I know now is that the slow down is not just related to the collection write end-point. Even the following is slow:

Copy code

for i in `seq 1 100`; do curl "<http://localhost:8108/health>" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"; done

Kishore Nallan

07/29/2022, 7:20 AM

All writes seem to be slow, which suggests some common bottle neck. Reads are fine (replace POST with GET in the command above)

pboros

07/29/2022, 7:22 AM

After deleting the coll2, inserting into coll3 (a new collection) is not slow for me.

pboros

07/29/2022, 7:22 AM

After fixing the small load not being hardcoded 🙂

pboros

07/29/2022, 7:23 AM

What we saw is that there is no io, during this, as soon as the inserts are over, resource utilization goes to 0, firing even some inserts would peg a single cpu core with a single thread

pboros

07/29/2022, 7:24 AM

The instance is not able to recover from this even after 6 hours-ish, which should be enough for anything.

pboros

07/29/2022, 7:24 AM

The next thing I would like to check if I can produce this at all with using random collection names every time.

Kishore Nallan

07/29/2022, 7:35 AM

Are you still seeing CPU being pegged now, after the change to the timeout?

pboros

07/29/2022, 7:36 AM

Yes

pboros

07/29/2022, 7:36 AM

Copy code

Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 11.8 us,  0.9 sy,  0.0 ni, 87.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16395664 total, 13430728 free,   319760 used,  2645176 buff/cache
KiB Swap:  1048572 total,  1048572 free,        0 used. 15374388 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0 2759544  73112  21612 S 100.6  0.4   8:04.90 typesense-serve
  318 root      20   0   18240   3312   2840 S   0.0  0.0   0:00.02 bash
  328 root      20   0   36644   3052   2624 R   0.0  0.0   0:00.00 top

pboros

07/29/2022, 7:36 AM

This is during the slow inserts

Kishore Nallan

07/29/2022, 7:40 AM

During

load_small

pboros

07/29/2022, 7:50 AM

Yes

Kishore Nallan

07/29/2022, 7:52 AM

I'm not seeing any CPU peg, but I switched to testing on local build (non-Docker) once I saw the slow down. So Docker is the other variable here, I will get back to it once I get a handle of the slow on local first.

pboros

07/29/2022, 7:53 AM

Sounds, great, I will see if I can reproduce it at all if every collection name is random. Let me know if I can help in any way.

pboros

07/29/2022, 8:34 AM

Testing with random names shows something interesting. If I create a new collection after I did the large delete, that shows the regression, but the collection I created before does not. https://gist.github.com/pboros/0583b7e32e658479e4175bf4f3e87a1c If I create the 4th collection at the time of startup, that does not show the regression either. https://gist.github.com/pboros/47c09242bdd9052e2bd98d88897741d6 Seems like all collections created after the large delete have the issue, but let me test this a bit more to confirm it.

👍 1

Kishore Nallan

07/29/2022, 10:43 AM

When I run the program both on native Mac binary and on the same 0.23.0 Docker image on Mac, while I see the slow writes, I don't see high cpu usage. Here's the result of `docker stats`:

Copy code

CONTAINER ID   NAME                    CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
07cb5d874739   dazzling_varahamihira   0.38%     141.9MiB / 3.841GiB   3.61%     516MB / 2.46MB   371MB / 27.9GB   231

pboros

07/29/2022, 10:45 AM

How about inside the container? The high cpu usage is only for the duration of the slow, small inserts

Kishore Nallan

07/29/2022, 10:45 AM

But you said it persists for 6+ hours right?

pboros

07/29/2022, 10:46 AM

I am also working on getting to know more: I created 100 collections, did small load on them, created big collections and dropped it, created 10 small collections afterwards: all is fine

pboros

07/29/2022, 10:46 AM

That's the slowness and not the CPU usage, CPU usage is only there during the inserts

pboros

07/29/2022, 10:47 AM

If I create 100 collections, run small load on 50, create the big one, delete it, than delete 50 small collections. After that run small load on the remaining 50 collections, they will be slow.

Kishore Nallan

07/29/2022, 10:47 AM

As for "slow" write I confirmed that the actual time taken to process the write inside the Typesense indexing engine is literally 0ms. The problem appears to be in the batch indexer, where a thread is woken up to process the write. I still don't know what's happening there but once large number of docs are indexed, this thread wake up takes 1 second to happen 🤔 The same issue does not happen if I index 1M docs. So I have to still figure that out, but persistent high CPU does not happen.

pboros

07/29/2022, 10:49 AM

@Kishore Nallan We also tested that, without dropping involved, we could index 10s os millions of docs without triggering the slowdown

pboros

07/29/2022, 10:50 AM

So there is a sleep somewhere in the thread woken up? The cpu utilization can be just waiting on mutexes for that woken up thread, how much of it is spin wait can be dependent on the os/environment

Kishore Nallan

07/29/2022, 10:51 AM

No, a condition variable is used so it must be instant, Here's where the thread associated with the queue is notified: https://github.com/typesense/typesense/blob/main/src/batched_indexer.cpp#L67

Kishore Nallan

07/29/2022, 10:52 AM

The condition variable is waited upon here: https://github.com/typesense/typesense/blob/main/src/batched_indexer.cpp#L135

Kishore Nallan

07/29/2022, 11:03 AM

It almost feel like OS is running short of some resources, like maximum threads or something so the mechanism of waking up changes at that point.

Kishore Nallan

07/29/2022, 11:05 AM

Two other things: 1. All POST end-points (e.g. POST /health) which uses the batch indexer suffer from the same problem, so this is unrelated to the underlying engine. 2. Collection created before/after the drop all suffer from the same problem. I am going to now investigate ulimit etc.

Kishore Nallan

07/29/2022, 11:07 AM

And yes, I do see high CPU usage during those slow inserts, but even hitting POST /health causes high cpu usage!

pboros

07/29/2022, 11:09 AM

I tried to check ulimits, but did not see anything obvious there on my side.

pboros

07/29/2022, 11:09 AM

Copy code

root@45652a6758b7:/# cat /proc/1/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63590                63590                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Kishore Nallan

07/29/2022, 12:31 PM

All of a sudden I'm unable to reproduce it 🤷 Would it be possible for you to run

strace

during the high CPU usage and get a capture of it? On Mac/Docker strace does not seem to work. Otherwise I have to launch a Linux instance to try later.

pboros

07/29/2022, 12:32 PM

Yes, we did that already, the thread that is eating up the cpu spends it's time in pread64

pboros

07/29/2022, 12:33 PM

I will grab an strace for you

pboros

07/29/2022, 12:34 PM

With

top -H

you can see what thread it is, in my case now it's 231:

Copy code

231 root      20   0 2751372  79668  21612 R 99.9  0.5   2:21.33 typesense-serve
    7 root      20   0 2751372  79668  21612 S  0.3  0.5   0:02.26 typesense-serve

pboros

07/29/2022, 12:36 PM

One moment, I need more privileges in the container

pboros

07/29/2022, 12:36 PM

I can share the one that is reproducing it with random collection names as well

Kishore Nallan

07/29/2022, 12:36 PM

Yeah that's fine, it should be the same issue.

Kishore Nallan

07/29/2022, 12:37 PM

The other thing we could do is to gdb attach to that thread to see what exactly that piece of code is. Pretty sure it's a batch indexer queue wake up thread, but not sure if we can confirm it based on symbols.

pboros

07/29/2022, 12:40 PM

Will do that.

Kishore Nallan

07/29/2022, 12:44 PM

Another downright strange behavior I'm seeing now is that the curl version of the query not slowing but the python call hanging:

Copy code

# Does not slow down
for i in `seq 2 100`; do curl "<http://localhost:8108/collections/coll1/documents?action=upsert>" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d "{\"id\": \"$i\", \"field1\": \"foo\", \"field2\": \"bar\"}"; done

This slows down and pegs CPU:

Copy code

client = get_client()
load_small(client, "coll1", 100)

pboros

07/29/2022, 12:47 PM

Originally we had the issue with the go driver, I wrote the example in python because I saw that's an official one

Kishore Nallan

07/29/2022, 12:47 PM

Scratch that, 2 different processes. I am going to take a break haha

pboros

07/29/2022, 12:47 PM

🙂

pboros

07/29/2022, 12:50 PM

With strace I do see the stalls, when it's in read calls I see things resembling the documents

pboros

07/29/2022, 12:50 PM

Copy code

root@3942c8bd03f4:/# strace -p 232 2>&1 | tee -a strace.out
strace: Process 232 attached
futex(0x7fbc2c4ea0cc, FUTEX_WAIT_PRIVATE, 207, NULL

pboros

07/29/2022, 12:51 PM

The thread that is eating up the cpu is actually changing, I managed to get some strace from one of those

pboros

07/29/2022, 12:52 PM

https://gist.github.com/pboros/cb324182148bc4f101b4bd10814afb10

pboros

07/29/2022, 12:52 PM

Here is a aprt of that

Kishore Nallan

07/29/2022, 12:56 PM

Thanks will check

pboros

07/29/2022, 12:57 PM

Thread dump from the thread that was eating up the cpu: https://gist.github.com/pboros/cc8a93d4b9876abd7bdb5c81e080f2da

pboros

07/29/2022, 12:58 PM

Seems to be rocksdb, which is matching the read calls

Kishore Nallan

07/29/2022, 12:59 PM

Ah, that's RocksDB. Wonder if a compaction is running. But I see nothing in the LOG file inside the

db

directory.

Kishore Nallan

07/29/2022, 12:59 PM

But it will kinda explain the whole issue since it happens write after a delete only.

pboros

07/29/2022, 12:59 PM

I don't think it's enough data for a compaction taking this long

pboros

07/29/2022, 1:00 PM

Copy code

root@3942c8bd03f4:/tmp# du -sh db/
77M	db/

pboros

07/29/2022, 1:00 PM

And also: no io

Kishore Nallan

07/29/2022, 1:00 PM

pread without io?

pboros

07/29/2022, 1:00 PM

I mean no significant io

pboros

07/29/2022, 1:01 PM

Copy code

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    10.00    0.00   15.00     0.00     0.10    13.33     0.01    0.33    0.00    0.33   0.67   1.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.65    0.00    1.00    0.13    0.00   87.22

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00    11.88    0.00   11.88     0.00     0.09    16.00     0.01    0.33    0.00    0.33   0.75   0.89

Kishore Nallan

07/29/2022, 1:01 PM

Okay thanks for all the dumps. If I'm not able to get any other clue, I am going to try removing the dumb "delete every single document" RocksDB API call with the delete range.

pboros

07/29/2022, 1:02 PM

That feels like the right direction

pboros

07/29/2022, 1:04 PM

Interesting thing from /stats:

pboros

07/29/2022, 1:04 PM

Copy code

{
  "delete_latency_ms": 0,
  "delete_requests_per_second": 0,
  "import_latency_ms": 0,
  "import_requests_per_second": 0,
  "latency_ms": {
    "GET /health": 0,
    "POST /collections/epnbiqxgrr/documents/": 470.34285714285716,
    "POST /collections/imkwwholpv/documents/": 651
  },
  "pending_write_batches": 1,
  "requests_per_second": {
    "GET /health": 0.1,
    "POST /collections/epnbiqxgrr/documents/": 3.5,
    "POST /collections/imkwwholpv/documents/": 0.5
  },
  "search_latency_ms": 0,
  "search_requests_per_second": 0,
  "total_requests_per_second": 4.1,
  "write_latency_ms": 492.925,
  "write_requests_per_second": 4
}

pboros

07/29/2022, 1:04 PM

While the thing is happening, so it's visible in write_latency_ms

Kishore Nallan

07/29/2022, 1:05 PM

Locally I have pretty much isolated the actually latency to the condition variable wake up, though. There is a ~1 second difference between notify() and wake up, which is exactly the time taken to insert the document.

Kishore Nallan

07/29/2022, 1:06 PM

Copy code

I20220729 18:33:31.825273 92155904 batched_indexer.cpp:67] batch indexer queue notify
I20220729 18:33:33.282867 87326720 batched_indexer.cpp:137] batch indexer queue awake

Kishore Nallan

07/29/2022, 1:07 PM

It is possible that this is a symptom rather than a cause, maybe RocksDB doing something which is putting pressure on scheduler.

Kishore Nallan

07/29/2022, 1:07 PM

But then again, cpu spikes ONLY during the actual write requests, not otherwise.

pboros

07/29/2022, 1:08 PM

Right

pboros

07/29/2022, 1:08 PM

As soon as there is no insert, there is no cpu burn

Kishore Nallan

07/29/2022, 1:09 PM

Let me implement the range delete

🙌 1

pboros

07/29/2022, 1:11 PM

Are the rocksdb stats available somewhere? I see that based on configuration it should dump every 10 minutes

Kishore Nallan

07/29/2022, 1:13 PM

There is a LOG file in the db directory

Kishore Nallan

07/29/2022, 3:01 PM

I've managed to fix this I think. Range delete did not help (but I've made the change anyway since it is better). The gdb dump stack trace began with a call to

rocksdb::DBIter::Next()

in batched indexer, so I figured it is taking too long to do db iterations after dropping so many keys. I did a full db compaction after the deletion of the keys and that seems to have helped. I initially tried range compaction on only the key range that was deleted but that didn't help either. Right now, I'm doing a blind compaction as part of every collection drop, but have to probably change it to happen only when a "large" collection is dropped. Not sure if there is another criteria (like some kind of key fragmentation) to use here.

Ovais Tariq

07/29/2022, 3:03 PM

If that's the case there may be further tweaking of RocksDB needed

Kishore Nallan

07/29/2022, 3:06 PM

I'm frankly surprised by this behavior. Compaction should be handled by RocksDB if it is deemed necessary. But the stack trace is clear: too much CPU on iteration which causes the scheduler to delay waking up the indexer thread.

pboros

07/29/2022, 3:06 PM

Rocksdb is a complex beast

pboros

07/29/2022, 3:07 PM

Does typesense have a way to take rocksdb configuration?

Kishore Nallan

07/29/2022, 3:08 PM

Haven't had the need to expose those yet.

pboros

07/29/2022, 3:08 PM

I mean to have the user specify rocksdb parameters

Kishore Nallan

07/29/2022, 3:08 PM

When deleting lots of rows, some of the SST files might be filled with tombstones and affected range scan performance. We extended RocksDB’s compaction to track long close-by tombstones. If a key range with high density of tombstones is detected, it immediately triggers another compaction in order to compact them away. This helped to reduce range scan performance skews caused by scanning excessed number of tombstones. In RocksDB, CompactOnDeletionCollectorFactory is the right class to do that.

This seems relevant.

Ovais Tariq

07/29/2022, 3:09 PM

A large number of deletes can cause this issue that you see in the scan. And this is not something that RocksDB default config is optimized for

Kishore Nallan

07/29/2022, 3:09 PM

No way to configure rocksdb via params yet

pboros

07/29/2022, 3:09 PM

If you expose, I am happy to help with tuning it/benchmarking typesense

👍 1

Ovais Tariq

07/29/2022, 3:09 PM

There is a configuration option to trigger compaction automatically based on threshold of tombstones. I will try to find it

Himank Chaudhary

07/29/2022, 4:35 PM

Kishore, one more data point is, in my local setup when I profiled the slow thread in a couple of cases I see the thread was busy iterating from the RocksDB and spending around 90% CPU, which I saw something similar in the prod thread as well. This means the batch indexer was spending a majority of time iterating and just reading to get the request before even processing it.

Kishore Nallan

07/29/2022, 4:58 PM

Yes this is the same stack trace that was caught earlier as well. It happens after the large deletion. Rocksdb probably is forced to skip a lot entries for every iteration. This went away when I issued a force compaction.

Himank Chaudhary

07/29/2022, 4:59 PM

👍

Ovais Tariq

07/29/2022, 6:07 PM

Yup with the large number of deletes scan is going to be slow as it has to skip over the tombstones. That is a known issue

Ovais Tariq

07/29/2022, 6:07 PM

@Kishore Nallan triggering a compaction might have some other side effects as well

Kishore Nallan

08/01/2022, 1:35 PM

Using

NewCompactOnDeletionCollectorFactory

as described here seems to be the most generic way to account for this: https://github.com/facebook/rocksdb/blob/7fd68b7c39c76b5741533ee2a4d7024f3b513f2d/include/rocksdb/utilities/table_properties_collectors.h#L85

Ovais Tariq

08/02/2022, 1:44 PM

Yup

Ovais Tariq

08/02/2022, 1:44 PM

When are you planning on releasing the fix?

Kishore Nallan

08/03/2022, 2:58 AM

I'll be testing this out this week and will have a patch available.

👍 2

Kishore Nallan

08/05/2022, 7:28 AM

Setting

NewCompactOnDeletionCollectorFactory

does not help. New writes post deletion still takes as long: I don't even see an improvement. It's not clear to me when this compaction will kick-in. While I do see compaction related logs while deletion happens, it's not clear if it's related to this and in any case, doesn't help. So far the only thing that helps is an outright full DB compaction right after the deletion. Even compacting the range of the deleted collection keys doesn't help.

Ovais Tariq

08/05/2022, 1:13 PM

I am actually surprised that DeleteRange did not help: https://github.com/facebook/rocksdb/wiki/DeleteRange

Ovais Tariq

08/05/2022, 1:14 PM

I am also surprised by this performance issue. Deleting ranges is a common operation that every system using RocksDB has to deal with.

Kishore Nallan

08/05/2022, 1:14 PM

Yes that's the first thing I tried: a) Modified deletes to use DeleteRange b) Tried compacting only the deleted range c) Now tried this compaction factory Nothing helped.

Kishore Nallan

08/05/2022, 1:14 PM

This seems to be a known issue, there have been a few posts.

Kishore Nallan

08/05/2022, 1:15 PM

https://github.com/facebook/rocksdb/issues/10300

Kishore Nallan

08/05/2022, 1:15 PM

Another one: https://github.com/facebook/rocksdb/issues/5265

Kishore Nallan

08/05/2022, 1:18 PM

One suggestion I'm yet to try is

read_options.iterate_upper_bound

which has been recommended. I will check that.

Ovais Tariq

08/05/2022, 1:20 PM

Both of those two issues suggest range delete.

Ovais Tariq

08/05/2022, 1:20 PM

What I mean that considering how common range deletes are there are not many issues reported.

Kishore Nallan

08/05/2022, 1:21 PM

Atleast one other person has mentioned that it didn't help: https://github.com/facebook/rocksdb/issues/5265#issuecomment-488088104

Ovais Tariq

08/05/2022, 1:25 PM

Have you tried setting this for tombstones in SST files - https://github.com/facebook/rocksdb/blob/1600aac46f64f906631f64d036b2ea5359c7ef5f/include/rocksdb/utilities/table_properties_collectors.h#L77-L92.

Kishore Nallan

08/05/2022, 1:26 PM

Yes that's the factory I mentioned earlier about trying. You have to set it as table options. This is what I did:

Copy code

options.table_properties_collector_factories.emplace_back(
                rocksdb::NewCompactOnDeletionCollectorFactory(10000, 7500, 0.5));

Kishore Nallan

08/05/2022, 1:28 PM

Maybe those values are too aggressive and never triggered the compaction?

Kishore Nallan

08/05/2022, 1:28 PM

I will try with smaller values as well.

Ovais Tariq

08/05/2022, 1:28 PM

Yeah could be

Ovais Tariq

08/05/2022, 1:32 PM

Try tweaking the deletion_trigger

Kishore Nallan

08/06/2022, 10:11 AM

Finally, I got compactions running by setting

options.level0_file_num_compaction_trigger = 1

(default: 4). Level 0 compaction did not trigger previously because there were not enough level_0 files to trigger that default value. I think if there were sustained writes after the deletion, it would have eventually happened. Need to still understand the implication of setting this value too low but the impact of large number of level-0 files are explained succinctly in this paragraph:

RocksDB level 0 SST files are different from other levels. Memtables are flushed to disk as level 0 SST files. The data in a level 0 SST file are arranged in the order of generation as the memtable, and data from the same regions can be spread across different files. Also, data with the same key can be in multiple SST files. Therefore, when a read occurs, each level 0 SST file must be read in order. The more level 0 SST files, the more likely query read performance will be impacted negatively.

Ovais Tariq

08/06/2022, 1:02 PM

That's most likely too aggressive though. As soon as write buffer would be flushed to disk it will trigger a compaction

Ovais Tariq

08/06/2022, 1:03 PM

That note applies when you have many L0 files

Kishore Nallan

08/06/2022, 1:10 PM

Agree. This is how the files and levels looked when compaction is not triggered:

Copy code

20M 000906.sst
21M 000909.sst
10M 001039.sst
9.9M 001040.sst
4.0K 001105.sst

L0: 0
L1: 1 (2 KB)
L2: 4 (60 MB)

Ovais Tariq

08/06/2022, 1:12 PM

Ok there is no L0 file?

Ovais Tariq

08/06/2022, 1:12 PM

And these are not that many files

Kishore Nallan

08/06/2022, 1:14 PM

Yes, I got these from the

LOG

file, which matched the file sizes on disk.

Ovais Tariq

08/06/2022, 1:14 PM

Hmm

Ovais Tariq

08/06/2022, 1:14 PM

And how does it look after triggering compaction?

Kishore Nallan

08/06/2022, 1:15 PM

After compaction:

Copy code

4.0K	001136.sst
 16K	001138.sst

Nothing logged in LOG file on this run. I've to probably use some rocksdb CLI tool or something to get what levels these are on.

Ovais Tariq

08/06/2022, 1:16 PM

Oh wow. Everything got compacted.

Kishore Nallan

08/06/2022, 1:17 PM

Yes, Level 1 triggers Level 2 and so on. The

max_bytes_for_level_base

is default of 256M so I am not able to understand how anything would ever end up in L2 because total size ~60M even without compaction.

Kishore Nallan

08/06/2022, 1:20 PM

Probably needs someone who is well versed in RocksDB internals to comment on this, and maybe having a stand-alone reproduceable example will help elucidate these observations.

Ovais Tariq

08/06/2022, 1:23 PM

There is some useful info here about how to get stats https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksDB-Tuning-Guide.md

👍 1

Ovais Tariq

08/06/2022, 1:49 PM

Let me look at how rocksdb is configured here

Ovais Tariq

08/06/2022, 1:50 PM

I think the write buffer size of 4MB is too small and will result in large number of L0 files when there is enough write throughput

Kishore Nallan

08/06/2022, 1:53 PM

Yes certainly we need to make these RocksDB configurations exposed for tweaking. I've another idea which I am now exploring. It's the

.next()

call which is taking time and not the initial iterator

seek

call. This is because the next call is not bounded and according to another comment on RocksDB GH, setting an upper bound on the iterator will help. This is because the deleted key range exists after the seek's prefix so maybe without an upper bound the iterator keeps looking at all those tombstones that can be skipped (there's an open but abandoned PR to handle this scenario).

Ovais Tariq

08/06/2022, 1:58 PM

Hmm

Ovais Tariq

08/06/2022, 1:59 PM

That would be worth exploring

Ovais Tariq

08/06/2022, 2:02 PM

Another option that would be good to be enabled by default: level_compaction_dynamic_level_bytes=true

Ovais Tariq

08/06/2022, 2:02 PM

And probably change the write buffer size to 16M

Kishore Nallan

08/06/2022, 2:02 PM

The docs warn about enabling

level_compaction_dynamic_level_bytes

on existing DBs, so need to plan how to do that.

Kishore Nallan

08/06/2022, 2:03 PM

(I mean not for this bug, but as a general default)

Ovais Tariq

08/06/2022, 2:08 PM

Yeah

Ovais Tariq

08/06/2022, 2:18 PM

Also not sure if bloom_filter is enabled

Ovais Tariq

08/06/2022, 2:18 PM

That's also worth enabling if not already enabled

Kishore Nallan

08/06/2022, 2:28 PM

What I'm going to do is extract this issue as a standalone cpp example so I can iterate faster on these tweaks. Will keep you posted.

👍 1

Ovais Tariq

08/06/2022, 2:30 PM

Cool

Kishore Nallan

08/08/2022, 1:29 PM

I was able to reproduce this issue on a standalone cpp program. Adding iterator upper_bound helps speed things up (since iteration can end even before the tombstones are encountered) but ran into another bug/issue with the upper bound API. I will have to post on RocksDB mailing list / SO later.

Ovais Tariq

08/08/2022, 6:04 PM

Cool, that's great progress

pboros

08/09/2022, 6:43 AM

Do you mind sharing the standalone program?

Kishore Nallan

08/09/2022, 6:47 AM

Here's the gist: https://gist.github.com/kishorenc/4ff763062f694665514d9d8ddcabae52 I've managed to fix the upper bound API issue I ran into earlier: it's because Slice is like a string_view and I was using a stack allocated local. Apart from that, in the code above, you will notice that compaction does not kick in despite using that factory.

Kishore Nallan

08/09/2022, 6:49 AM

I'm now incorporating the change of setting upper bound on the batch indexer iterator which fixes the Next() stall because of the deleted tombstones. However compaction not kicking in is still an issue which I need to raise with the rocksdb folks. Maybe my understanding of how that factory option works is not correct.

Kishore Nallan

08/09/2022, 1:29 PM

I've published

typesense/typesense:0.24.0.rc28

Docker image that contains the iterator upper bound patch that has eliminated the post-bulk-delete write latencies for me locally.

🎉 3

🙌 3

Kishore Nallan

08/10/2022, 4:50 AM

RocksDB folks also clarified about the

NewCompactOnDeletionCollectorFactory

-- this only works for single deletes and not for range deletes. If range delete is used, then the use of factory is obsolete. I've already switched the collection deletion to range delete as well, so I think we should be good now.

👍 1

Ovais Tariq

08/10/2022, 3:26 PM

Great.

pboros

08/10/2022, 3:30 PM

The fix works with our original use case, as well as with my tests.

🙌 1

Kishore Nallan

08/10/2022, 3:38 PM

Thank you for all the help in identifying and fixing this!

Ovais Tariq

08/10/2022, 4:06 PM

Thank you for prioritizing the issue and fixing it

Open in Slack

Previous Next