#random

Troubleshooting Unhealthy Cluster Issue

TLDR Sruli was unable to utilize their cluster. Jason suggested an update, which didn't solve the issue, then diagnosed the problem as a large string causing crashes. The resolution required resetting the cluster state.

Powered by Struct AI

3

Jan 27, 2023 (11 months ago)
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
12:54 AM
What does an unhelthy cluster mean, it’s not letting me search or add to the db
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:58 AM
It usually means that the node is out of RAM / CPU. But I just took a look at your cluster, and it looks like some data that was sent to it is causing it to crash due to a bug.
12:58
Jason
12:58 AM
Let’s try upgrading you to the latest RC build to see if that fixes the issue.
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
12:59 AM
Is that something I will have to do?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:59 AM
You can schedule it from the dashboard, but since we’re already chatting, I’ve gone ahead and queued up the upgrade for you
12:59
Jason
12:59 AM
Should be done in about 5 mins
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
12:59 AM
Thank you 🙏

1

01:06
Sruli
01:06 AM
Looks like it’s still not working
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:08 AM
Yeah looks like the issue persists in the latest RC build. So unfortunately the only way we can recover this cluster is by resetting cluster state (which will delete all data and API keys), and have you reindex your data.

I’m curious to see if you index it fresh in 0.24.0.rc, if it still manages to get into this state
01:08
Jason
01:08 AM
Would you be ok with us resetting cluster state?
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:08 AM
Is it possible to downgrade it to a previous version?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:09 AM
The previous version also has this issue
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:09 AM
Because the error wasn’t happening in my last cluster, with the exact same data..
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:09 AM
Hmmm
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:09 AM
Could be a new issue introduces in one of the newer versions
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:09 AM
Your previous cluster was running 0.23.1, and this new cluster was also running 0.23.1
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:10 AM
Weird
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:10 AM
Just to be sure, the data is static and there were zero changes in the dataset between your old cluster and new one?
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:10 AM
Not exactly
01:11
Sruli
01:11 AM
But very similar
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:13 AM
Looks like the dataset now has a huge string (in Hebrew?) in a field that has sorting enabled… and that’s causing the crash.
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:13 AM
Oh, that makes sense
01:13
Sruli
01:13 AM
I couldn’t remember the exact collections settings
01:13
Sruli
01:13 AM
Can you disable that from your end or will I need to reset the whole thing?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:15 AM
Unfortunately, we’d need to reset the whole thing
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:16 AM
Not a problem, can you pull out the current collection settings or also no?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:23 AM
We can recover the collection settings. DMing it to you
01:24
Jason
01:24 AM
Done
01:24
Jason
01:24 AM
You essentially want to turn off sorting on the long string field.
01:24
Jason
01:24 AM
We’re working on adding a guard for this in the next RC build
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:24 AM
Cool!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:24 AM
In the meantime, ok to reset cluster state?
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:24 AM
Yup
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:26 AM
Ok, you’re all set. Cluster should be healthy again.
01:26
Jason
01:26 AM
Sorry about the issue, and thank you for helping catch this
Sruli
Photo of md5-8c57c435e01337c18bf32b8f7749cabf
Sruli
01:27 AM
Thank you!!

1

01:27
Sruli
01:27 AM
Really appreciate the quick responses

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community