Has anyone experienced stability issues with their...
# community-help
r
Has anyone experienced stability issues with their self-hosted typesense server? We have less than 10,000 records in typesense deployed to an AWS ECS task. On average it utilizes 100mb memory running, with 400mb limit. Usually once, perhaps 2x per day the server will just freeze for one minute. Our clients receive a connection timeout trying to reach the server. The logs are not helpful but you can see where the consistency of the logs (10 seconds) become inconsistent or silent for periods of time. (8/6 14:36 GMT). Running v0.24.1
I don’t believe it’s memory/cpu related. It stays very constant around 21% / 100mb.
24 hours
message has been deleted
j
Doesn’t ECS spin containers up and down on-demand? Could this be related to cold-start times?
r
Nope. It’s been Running for 14 days.
j
Ah hmm, I’m confusing ECS with Fargate
Do you see any logs in that timeframe? Usually Typesense will log the current sequence number every 3-5s
Or do the logs also freeze during this time?
r
see xlsx spreadsheet above
I can paste csv if needed
j
I do see logs during this period
So strange… I’ve never heard of such an issue or seen it on Typesense Cloud either. Is there another HTTP proxy in front of Typesense? Like an API gateway?
r
look at the consistentcy of the prior logs, 00, 10, 20, 30, 40, 50 00, …
Typesense is running internal only. Not exposed externally. Our API uses python typesense client to connect to it.
message has been deleted
j
Do you have at least 2vCPU allocated to the container?
r
t4g.medium
no vcpu restrictions
j
One way to narrow this down would be to try restarting the Typesense process to see if that fixes the issue… If it does, then it’s more likely a Typesense issue than not. If not, it could be coming from elsewhere in the stack
Another option to is to upgrade to
0.25.0.rc61
, we’ve fixed some deadlock issues related to writes… (though that doesn’t seem to be the case here)
r
We’ve not been able to reproduce it ourselves - by writing a load test script to hit it thousands of times with random queries.
We only write to it a few times per week. I’ve not thought about correlating this but I can check.
I’m using this modified container image in an attempt to have ECS restart the container. I don’t think ECS has had to perform a docker restart though. https://github.com/typesense/typesense/pull/1097/files
j
Probably side note: the /health endpoint will return unhealthy for eg when read / write lag threshold is breached. And you don’t want to restart Typesense at that point as it’s working through the backlog
A tcp healthcheck to ensure that there’s a process listening on the Typesense port is what you’d want to use in this particular context
Anyway, I’m not too sure what else could be happening here…
Could you may be try removing this healthcheck and see if that helps?
r
the healthcheck is just checking for HTTP 2xx. Are you saying that this backlog state returns something besides HTTP 200 status code?
j
Correct, it will return a 503
r
I can change it to tcp healthcheck as you suggested. I’m not sure how I feel about 503 though on your end…
@Jason Bosco Can you take a look at these logs and help me understand what’s going on under Typesense’s hood? It was down for 3 minutes last night, correlating with specific log entries.
message has been deleted
j
I’ve seen this happen usually when CPU is fully exhausted, or in the case of AWS when burst capacity runs out
r
Any other ideas? CPU was < 5%
j
Not that I can think of… I wonder if there’s something unique about the ECS environment that could cause this. Could you try running on EC2 machines to see if you can replicate the issue there?