Has anyone experienced stability issues with their self host typesense #community-help

Has anyone experienced stability issues with their...

Ryan Gartin

08/07/2023, 3:50 PM

Has anyone experienced stability issues with their self-hosted typesense server? We have less than 10,000 records in typesense deployed to an AWS ECS task. On average it utilizes 100mb memory running, with 400mb limit. Usually once, perhaps 2x per day the server will just freeze for one minute. Our clients receive a connection timeout trying to reach the server. The logs are not helpful but you can see where the consistency of the logs (10 seconds) become inconsistent or silent for periods of time. (8/6 14:36 GMT). Running v0.24.1

Ryan Gartin

08/07/2023, 3:54 PM

I don’t believe it’s memory/cpu related. It stays very constant around 21% / 100mb.

Ryan Gartin

08/07/2023, 3:54 PM

24 hours

Ryan Gartin

08/07/2023, 3:55 PM

message has been deleted

Jason Bosco

08/07/2023, 6:15 PM

Doesn’t ECS spin containers up and down on-demand? Could this be related to cold-start times?

Ryan Gartin

08/07/2023, 6:20 PM

Nope. It’s been Running for 14 days.

Jason Bosco

08/07/2023, 6:21 PM

Ah hmm, I’m confusing ECS with Fargate

Jason Bosco

08/07/2023, 6:22 PM

Do you see any logs in that timeframe? Usually Typesense will log the current sequence number every 3-5s

Jason Bosco

08/07/2023, 6:22 PM

Or do the logs also freeze during this time?

Ryan Gartin

08/07/2023, 6:22 PM

see xlsx spreadsheet above

Ryan Gartin

08/07/2023, 6:23 PM

I can paste csv if needed

Jason Bosco

08/07/2023, 6:25 PM

I do see logs during this period

Jason Bosco

08/07/2023, 6:26 PM

So strange… I’ve never heard of such an issue or seen it on Typesense Cloud either. Is there another HTTP proxy in front of Typesense? Like an API gateway?

Ryan Gartin

08/07/2023, 6:27 PM

look at the consistentcy of the prior logs, 00, 10, 20, 30, 40, 50 00, …

Ryan Gartin

08/07/2023, 6:27 PM

Typesense is running internal only. Not exposed externally. Our API uses python typesense client to connect to it.

Ryan Gartin

08/07/2023, 6:28 PM

message has been deleted

Jason Bosco

08/07/2023, 6:29 PM

Do you have at least 2vCPU allocated to the container?

Ryan Gartin

08/07/2023, 6:29 PM

t4g.medium

no vcpu restrictions

Jason Bosco

08/07/2023, 6:31 PM

One way to narrow this down would be to try restarting the Typesense process to see if that fixes the issue… If it does, then it’s more likely a Typesense issue than not. If not, it could be coming from elsewhere in the stack

Jason Bosco

08/07/2023, 6:32 PM

Another option to is to upgrade to

0.25.0.rc61

, we’ve fixed some deadlock issues related to writes… (though that doesn’t seem to be the case here)

Ryan Gartin

08/07/2023, 6:33 PM

We’ve not been able to reproduce it ourselves - by writing a load test script to hit it thousands of times with random queries.

Ryan Gartin

08/07/2023, 6:34 PM

We only write to it a few times per week. I’ve not thought about correlating this but I can check.

Ryan Gartin

08/07/2023, 6:36 PM

I’m using this modified container image in an attempt to have ECS restart the container. I don’t think ECS has had to perform a docker restart though. https://github.com/typesense/typesense/pull/1097/files

Jason Bosco

08/07/2023, 6:40 PM

Probably side note: the /health endpoint will return unhealthy for eg when read / write lag threshold is breached. And you don’t want to restart Typesense at that point as it’s working through the backlog

Jason Bosco

08/07/2023, 6:41 PM

A tcp healthcheck to ensure that there’s a process listening on the Typesense port is what you’d want to use in this particular context

Jason Bosco

08/07/2023, 6:42 PM

Anyway, I’m not too sure what else could be happening here…

Jason Bosco

08/07/2023, 6:42 PM

Could you may be try removing this healthcheck and see if that helps?

Ryan Gartin

08/07/2023, 6:43 PM

the healthcheck is just checking for HTTP 2xx. Are you saying that this backlog state returns something besides HTTP 200 status code?

Jason Bosco

08/07/2023, 6:43 PM

Correct, it will return a 503

Ryan Gartin

08/07/2023, 6:44 PM

I can change it to tcp healthcheck as you suggested. I’m not sure how I feel about 503 though on your end…

Ryan Gartin

08/09/2023, 2:17 PM

@Jason Bosco Can you take a look at these logs and help me understand what’s going on under Typesense’s hood? It was down for 3 minutes last night, correlating with specific log entries.

Ryan Gartin

08/09/2023, 2:21 PM

message has been deleted

Jason Bosco

08/09/2023, 3:49 PM

I’ve seen this happen usually when CPU is fully exhausted, or in the case of AWS when burst capacity runs out

Ryan Gartin

08/09/2023, 7:58 PM

Any other ideas? CPU was < 5%

Jason Bosco

08/09/2023, 8:00 PM

Not that I can think of… I wonder if there’s something unique about the ECS environment that could cause this. Could you try running on EC2 machines to see if you can replicate the issue there?

2 Views

Open in Slack

Previous Next