#community-help

Stability Issues with Self-Hosted Typesense Server on AWS ECS

TLDR Ryan reports stability issues with a self-hosted typesense server, whilst Jason suggests various potential factors to consider, including configuration, CPU allocation, modifying the healthcheck, and testing on EC2 instances. However, no definitive solution is reached.

Powered by Struct AI
Aug 07, 2023 (4 months ago)
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
03:50 PM
Has anyone experienced stability issues with their self-hosted typesense server? We have less than 10,000 records in typesense deployed to an AWS ECS task. On average it utilizes 100mb memory running, with 400mb limit. Usually once, perhaps 2x per day the server will just freeze for one minute. Our clients receive a connection timeout trying to reach the server. The logs are not helpful but you can see where the consistency of the logs (10 seconds) become inconsistent or silent for periods of time. (8/6 14:36 GMT). Running v0.24.1
03:54
Ryan
03:54 PM
I don’t believe it’s memory/cpu related. It stays very constant around 21% / 100mb.
03:54
Ryan
03:54 PM
24 hours
Image 1 for 24 hours
03:55
Ryan
03:55 PM
Image 1 for
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:15 PM
Doesn’t ECS spin containers up and down on-demand? Could this be related to cold-start times?
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:20 PM
Nope. It’s been Running for 14 days.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:21 PM
Ah hmm, I’m confusing ECS with Fargate
06:22
Jason
06:22 PM
Do you see any logs in that timeframe? Usually Typesense will log the current sequence number every 3-5s
06:22
Jason
06:22 PM
Or do the logs also freeze during this time?
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:22 PM
see xlsx spreadsheet above
06:23
Ryan
06:23 PM
I can paste csv if needed
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:25 PM
I do see logs during this period
06:26
Jason
06:26 PM
So strange… I’ve never heard of such an issue or seen it on Typesense Cloud either. Is there another HTTP proxy in front of Typesense? Like an API gateway?
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:27 PM
look at the consistentcy of the prior logs, 00, 10, 20, 30, 40, 50 00, …
06:27
Ryan
06:27 PM
Typesense is running internal only. Not exposed externally. Our API uses python typesense client to connect to it.
06:28
Ryan
06:28 PM
Image 1 for
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:29 PM
Do you have at least 2vCPU allocated to the container?
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:29 PM
t4g.medium no vcpu restrictions
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:31 PM
One way to narrow this down would be to try restarting the Typesense process to see if that fixes the issue… If it does, then it’s more likely a Typesense issue than not. If not, it could be coming from elsewhere in the stack
06:32
Jason
06:32 PM
Another option to is to upgrade to 0.25.0.rc61, we’ve fixed some deadlock issues related to writes… (though that doesn’t seem to be the case here)
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:33 PM
We’ve not been able to reproduce it ourselves - by writing a load test script to hit it thousands of times with random queries.
06:34
Ryan
06:34 PM
We only write to it a few times per week. I’ve not thought about correlating this but I can check.
06:36
Ryan
06:36 PM
I’m using this modified container image in an attempt to have ECS restart the container. I don’t think ECS has had to perform a docker restart though. https://github.com/typesense/typesense/pull/1097/files
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:40 PM
Probably side note: the /health endpoint will return unhealthy for eg when read / write lag threshold is breached. And you don’t want to restart Typesense at that point as it’s working through the backlog
06:41
Jason
06:41 PM
A tcp healthcheck to ensure that there’s a process listening on the Typesense port is what you’d want to use in this particular context
06:42
Jason
06:42 PM
Anyway, I’m not too sure what else could be happening here…
06:42
Jason
06:42 PM
Could you may be try removing this healthcheck and see if that helps?
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:43 PM
the healthcheck is just checking for HTTP 2xx. Are you saying that this backlog state returns something besides HTTP 200 status code?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:43 PM
Correct, it will return a 503
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
06:44 PM
I can change it to tcp healthcheck as you suggested. I’m not sure how I feel about 503 though on your end…
Aug 09, 2023 (4 months ago)
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
02:17 PM
Jason Can you take a look at these logs and help me understand what’s going on under Typesense’s hood? It was down for 3 minutes last night, correlating with specific log entries.
02:21
Ryan
02:21 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:49 PM
I’ve seen this happen usually when CPU is fully exhausted, or in the case of AWS when burst capacity runs out
Ryan
Photo of md5-1707618e0f514f64a2c2b9fd7d790803
Ryan
07:58 PM
Any other ideas? CPU was < 5%
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:00 PM
Not that I can think of… I wonder if there’s something unique about the ECS environment that could cause this. Could you try running on EC2 machines to see if you can replicate the issue there?

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Addressing Typesense Server Issues and Optimization Needs

Robert had an issue with a 'stuck' typesense server. Jason and Kishore Nallan gave advice on handling writes, configuration for high search volumes, and running multiple typesense instances. They also recommended monitoring CPU usage and updating the server version for bug fixes.

1

30
14mo

Troubleshooting Write Timeouts in Typesense with Large CSVs

Agustin had issues with Typesense getting write timeouts while loading large CSV files. Kishore Nallan suggested chunking data or converting to JSONL before loading. Through troubleshooting, they identified a possible network problem at AWS and found a workaround.

2

59
32mo

Optimizing Typesense Implementation for Large Collections

Oskar faced performance issues with his document collection in Typesense due to filter additions. Jason suggested trying a newer Typesense build and potentially partitioning the data into country-wise collections. They also discussed reducing network latency with CDN solutions.

5

67
11mo

Issues deploying Typesense to AWS EKS

Pavan had issues when deploying Typesense to AWS EKS. Kishore Nallan suggested deployment to plain EC2 instances and provided the API key information. Eventually, Pavan resolved the issue with Helm.

1

11
17mo

Troubleshooting Typesense Document Import Error

Christopher had trouble importing 2.1M documents into Typesense due to memory errors. Jason clarified the system requirements, explaining the correlation between RAM and dataset size, and ways to tackle the issue. They both also discussed database-like query options.

3

30
10mo