My cluster (`v601y2x3upjea4tip`) is Unhealthy. Can...
# community-help
g
My cluster (
v601y2x3upjea4tip
) is Unhealthy. Can someone check it?
j
Yup, already looking into it. It looks like there’s a write that’s stalling all other API calls
Working on addressing it
👍 1
g
It's working now
Hmm, actually not 😄
j
The node is still recovering… Should take another 5-10 minutes
It’s processing the backlog of writes right now
👍 1
Are you using OpenAI’s embedding model?
g
Yes
j
Hmm look like their API might be taking a long time to respond, which is what’s stalling the writes
We should probably set a shorter timeout value to error out in cases like this
g
Why does it spike like that? There shouldn't be any special traffic. Only thing that's different is Typesense's instability.
j
That’s the reindexing happening on each process restart
We’ve been trying to restart the Typesense process to get the writes to be retried
g
Got it
j
Going to try to skip the writes in the queue to stabilize the cluster
👍 1
Ok the cluster is back up now…
Across two restarts the write queue got stuck around the same position… So I suspect some input might be stalling the OpenAI API somehow or may be there’s a lurking bug somewhere on the Typesense side. Will take a closer look and keep you posted.
g
Alright, thank you
I had some timeouts around 13:31 trying to write some documents. Do you think they were probably due to this issue?
Sorry, forgot to mention it's Brazilian Time (UTC-3)
So it was around 16:31 UTC
j
This particular issue started around 10 minutes before your first message in this thread
g
Yeah, so it was probably related to it
The cluster is unhealthy again, are you aware?
j
Hmm, same issue again 😞
Back up
g
Searching works, writing gives me:
ObjectUnprocessable: Request failed with HTTP code 422 | Server said: Skipping writes.
I think re-enabling writes will cause the cluster to go into the same state the next time there’s an issue with OpenAI APIs until we fix the timeouts. ETA for that fix is probably 24 hours or so
g
24h without writes will cause a bit of a mess here 😟, isn't there some other quicker workaround?
j
I re-enabled writes, and the cluster is already back to the same state. We can try reverting to an earlier build to see if this issue just started now
g
I'm fine if you revert to the RC35, I'm not depending on the fixes from RC39 yet
j
Ok queued the downgrade up, should start running in about a minute
g
Searches and writes working with RC35
🤞 1
👍 1
j
Will keep you posted on the fix
g
Great, thanks
j
@Gustavo Would it be ok to take a snapshot of your dataset and load it in an internal debug Typesense instance to narrow down this issue?
g
Are you able to take a snapshot without including the
users
collection?
j
No unfortunately, the snapshot includes all data in the cluster
g
Not sure in that case 😕
Is there another way I can help you?
If you can create a cluster and add me as a member, I can save the data from the
posts
collection in it, which is the collection that uses OpenAI
j
Ok that works, let me do that and ping you
👍 1