#community-help

Production Cluster Failure and Solution

TLDR Andrew experienced an unexpected production cluster failure. Kishore Nallan and Jason helped diagnose the problem, remediated it, and upgraded the cluster to prevent future issues.

Powered by Struct AI

1

1

Apr 21, 2021 (34 months ago)
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
03:09 PM
At 9:17AM CDT my production cluster bue2jx8qic7kstwrp unexpectidly stopped working, and was listed as, ‘unhealthy’ by typesense
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:20 PM
Looking into it Andrew
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
03:20 PM
Like I just had to spin up a new cluster and reindex everything
03:20
Andrew
03:20 PM
Suuuper scary, glad my customers are mostly on the west coast
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:21 PM
Sorry about that. I will share the findings with you shortly. We have to improve the cloud UX to expose logs and other resolution actions from the UI.
03:47
Kishore Nallan
03:47 PM
The issue happened because there was a null value in a field defined as a string[].

In v0.19, we validate only the first entry in an array for the type. This has since been fixed on master and I will be happy to migrate your cluster to a stable 0.20 RC build if you like. Please remove null values from arrays for now as a workaround.
03:48
Kishore Nallan
03:48 PM
It was the second value in this case Andrew and that caused an issue.
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:04 PM
if you migrate my cluster to 0.20, I’ll need to update my client library, yeah?
04:04
Andrew
04:04 PM
This is really stressful, stuff is broken
04:04
Andrew
04:04 PM
again
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:06 PM
Andrew No, you can use previous versions of the library with v0.20

1

Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:06 PM
Kishore Nallan Jason like I have to maintain 99.9% uptime, and the clock is ticking. I don’t want to have to spin up another cluster, it sounds like that won’t actually do any good?

If I get this right, someone can upgrade my existing cluster dez7hbjn35u89a1mp to a higher version, and the issue wil lgo away?
04:06
Andrew
04:06 PM
Sweet, so you upgrade the cluster, and everything will just work?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:06 PM
Yup, I can upgrade dez7hbjn35u89a1mp .
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:07 PM
❤️ plz and thank you
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:07 PM
Oh wait, you'd need to reindex the data once I upgrade to v0.20. To get the bad data out of the logs
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:07 PM
thats fine
04:07
Andrew
04:07 PM
only like 2k records
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:09 PM
Andrew Ok to wipe the existing data from the cluster, yeah?
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:09 PM
yup
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:09 PM
You'd need to generate a new API key btw
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:10 PM
noooooooo
04:10
Andrew
04:10 PM
that’s a deploy which will take like 20 minutes
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:10 PM
We can just drop the collections if there aren’t many.
04:10
Kishore Nallan
04:10 PM
Jason you don’t need to delete the data
04:10
Kishore Nallan
04:10 PM
Jason Just upgrade, the v0.20 should handle the bad data.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:11 PM
Oh cool, ok
04:11
Jason
04:11 PM
Ok, cluster is running v0.20
Andrew
Photo of md5-08f6fb4c00b4a074647988ce90a07f5c
Andrew
04:12 PM
🎺
04:12
Andrew
04:12 PM
ight data is showing up on my customers dashboard

1

04:12
Andrew
04:12 PM
thanks guys
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
04:13 PM
I apologize once again for this: we are working on automatic recovery from bad data so that bad records can be skipped over.

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Addressing Cluster Issue due to Excessive Data

Andrew had trouble with cluster operations due to excessive data and collections. Jason advised flushing the data and stated that the upcoming update will remedy such issues. Both agreed to stick to v0.19 and not to fill the cluster excessively.

5

40
35mo

Troubleshooting Unhealthy Cluster Issue

Sruli was unable to utilize their cluster. Jason suggested an update, which didn't solve the issue, then diagnosed the problem as a large string causing crashes. The resolution required resetting the cluster state.

3

37
11mo

Typesense Bug Fix with `canceled_at` Field and Upgrade Concerns

Mateo reported an issue regarding the treatment of an optional field by Typesense which was confirmed a bug by Jason. After trying an upgrade, an error arose. Jason explained the bug was due to a recent change and proceeded to downgrade their version. Future upgrade protocols were discussed.

3

74
10mo

Resolving Unhealthy Typesense Cluster and JSON Parsing Bug

Masahiro reported an unhealthy Typesense cluster. The cause was a parsing bug related to boolean values in JSON schemas. Jason resolved the issue by clearing node data and upgrading the server to v0.20, which resolved the issue and Masahiro's team decided to use Typesense.

7

26
33mo

Typesense Cluster Upgrade Issues and Solutions

Ken reported a system outage due to Typesense cluster upgrade issues. Jason recommended upgrading to the next RAM tier and explained when the auto upgrade feature takes effect. After a repeat issue, Jason added handling upgrades when disk space ran out to their backlog.

10

23
2mo