#community-help

Investigating Unhealthy Cluster and Typesense Issues

TLDR Gustavo reported an unhealthy cluster, which Jason identified was due to stalled writes from OpenAI's API. The cluster was temporarily stabilized by downgrading to RC35, but a full resolution is still pending.

Powered by Struct AI
+15
crossed_fingers1
Jun 26, 2023 (3 months ago)
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
04:44 PM
My cluster (v601y2x3upjea4tip) is Unhealthy. Can someone check it?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:45 PM
Yup, already looking into it. It looks like there’s a write that’s stalling all other API calls
04:45
Jason
04:45 PM
Working on addressing it
+11
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
04:48 PM
It's working now
04:50
Gustavo
04:50 PM
Hmm, actually not 😄
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:50 PM
The node is still recovering… Should take another 5-10 minutes
04:51
Jason
04:51 PM
It’s processing the backlog of writes right now
+11
04:52
Jason
04:52 PM
Are you using OpenAI’s embedding model?
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
04:53 PM
Yes
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:53 PM
Hmm look like their API might be taking a long time to respond, which is what’s stalling the writes
04:54
Jason
04:54 PM
We should probably set a shorter timeout value to error out in cases like this
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
05:08 PM
Why does it spike like that? There shouldn't be any special traffic. Only thing that's different is Typesense's instability.
Image 1 for Why does it spike like that? There shouldn't be any special traffic. Only thing that's different is Typesense's instability.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:09 PM
That’s the reindexing happening on each process restart
05:09
Jason
05:09 PM
We’ve been trying to restart the Typesense process to get the writes to be retried
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
05:09 PM
Got it
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:10 PM
Going to try to skip the writes in the queue to stabilize the cluster
+11
05:20
Jason
05:20 PM
Ok the cluster is back up now…
05:21
Jason
05:21 PM
Across two restarts the write queue got stuck around the same position… So I suspect some input might be stalling the OpenAI API somehow or may be there’s a lurking bug somewhere on the Typesense side. Will take a closer look and keep you posted.
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
05:23 PM
Alright, thank you
05:51
Gustavo
05:51 PM
I had some timeouts around 13:31 trying to write some documents. Do you think they were probably due to this issue?
05:54
Gustavo
05:54 PM
Sorry, forgot to mention it's Brazilian Time (UTC-3)
05:55
Gustavo
05:55 PM
So it was around 16:31 UTC
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:56 PM
This particular issue started around 10 minutes before your first message in this thread
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
05:57 PM
Yeah, so it was probably related to it
06:08
Gustavo
06:08 PM
The cluster is unhealthy again, are you aware?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:10 PM
Hmm, same issue again 😞
06:29
Jason
06:29 PM
Back up
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
06:30 PM
Searching works, writing gives me: ObjectUnprocessable: Request failed with HTTP code 422 | Server said: Skipping writes.
06:31
Jason
06:31 PM
I think re-enabling writes will cause the cluster to go into the same state the next time there’s an issue with OpenAI APIs until we fix the timeouts. ETA for that fix is probably 24 hours or so
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
06:34 PM
24h without writes will cause a bit of a mess here 😟, isn't there some other quicker workaround?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:37 PM
I re-enabled writes, and the cluster is already back to the same state. We can try reverting to an earlier build to see if this issue just started now
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
06:38 PM
I'm fine if you revert to the RC35, I'm not depending on the fixes from RC39 yet
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:39 PM
Ok queued the downgrade up, should start running in about a minute
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
06:46 PM
Searches and writes working with RC35
+11
crossed_fingers1
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:47 PM
Will keep you posted on the fix
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
06:48 PM
Great, thanks
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:08 PM
Gustavo Would it be ok to take a snapshot of your dataset and load it in an internal debug Typesense instance to narrow down this issue?
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
10:57 PM
Are you able to take a snapshot without including the users collection?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:57 PM
No unfortunately, the snapshot includes all data in the cluster
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
11:00 PM
Not sure in that case 😕
11:00
Gustavo
11:00 PM
Is there another way I can help you?
11:02
Gustavo
11:02 PM
If you can create a cluster and add me as a member, I can save the data from the posts collection in it, which is the collection that uses OpenAI
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:02 PM
Ok that works, let me do that and ping you
+11