Hi, Any idea of the possible reason why there's q...
# community-help
c
Hi, Any idea of the possible reason why there's quite a difference in returned
search_time_ms
from 2 environments using exactly the same search parameters? • Production env - we're for looping thousands of users and searching via typesense but there are some
search_time_ms
that returns an average of
6000+ ms
• Local device - tried to replicate this while the for-loop in prod is running, local app is pointed on same typesense collection with same search parameters and
search_time_ms
only returns less than
1000ms
k
How many concurrent searches are happening on production env?
c
@Kishore Nallan we have a celery worker task that for-loops +-5k users, for each user we're using
multi_search
feature with at least 4 search parameters each then we have another web app, where one endpoint search the same collection, number of request depends on the logged in users
since we implemented typesense, this long queries only happened since last week
k
I wonder if these backend tasks are competing with CPU for real-time searches happening on the same cluster.
c
@Kishore Nallan current CPU status from the typesense dashboard
@Kishore Nallan is it possible that one node(black one) seems to always getting all loads? though we're using load balancing
k
Ah yes. I suspect that the IP returned by the load balanced DNS is being cached so all workers are using the same underlying host. You can try configuring the python client to use individual hosts by shuffling them.
c
@Kishore Nallan what do you mean by
"""the python client to use individual hosts by shuffling them."""
? this is our typesense config, shouldn't this handle load balancing to avoid 1 node from processing almost everything?
k
This will pick only the first host.
Only if that fails, another will be used
are we missing the
nearest_node
key
f
The nearest node will be the one that will be used by default if it exists
k
The nearest node will help if multiple instances of the client as used. If a single instance is used, the underlying resolution of the IP could still be cached.
c
@Fanis Tharropoulos yea, that's what I thought, we removed it before
@Kishore Nallan so, is there typesense config that handles this to avoid a certain node from processing everything? I mean also to handle cached IP or we need to manually handle it to use individual hosts by shuffling them? and avoid the workers of using the same node
k
You have to use individual hosts and shuffle them for each client.
c
@Kishore Nallan sorry, just want clarify, you're referring to something like this? separate client config for each node?
k
No. Instead of always having -1,-2, -3 order shuffle this order. So each client instance uses a different order. Because the first host is picked and used by the client.
@Fanis Tharropoulos maybe we need to allow the client to round robin the hosts as an option.
c
@Kishore Nallan i see, problem now is that we only instantiate 1 client and it's during the start of the app, so all typesense queries uses that instantiated client
k
Let's see if we can add a round robin rotation feature to the python client.
c
@Kishore Nallan sorry, as I continued testing it, was able to replicate the slow queries, we have 3 nodes, tested every node and only on 1 node that the query is very slow, does this confirm that specific node might be overloaded during that query and the other 2 nodes are not?
k
Yes, the query performance will depend on other searches that are happening on the node.
c
@Kishore Nallan sorry for lot of questions, we just implemented a round robin using the 3 nodes but still we're having slow queries of 6seconds on average, from our logs, it seems that, there's only 1 node that's slow (not yet 100% sure on this one) would you be able to check on your side regarding our burst per day (2 vCPUs, 4 hr burst per day) if it's being used and reset?
the slow queries only happens when we run this scheduled job where it iterates around 6k users and run search for it, their filters are almost the same, but only some are having slow queries, we're thinking that this happens if a query hit the slow node, maybe you have a way on your side to check the status of the nodes (burst per day or other things that might cause)?
k
Please share your cluster ID
c
@Kishore Nallan sent cluster id via pm
k
Until about 10-15 mins ago, 1/3 nodes had high latency and cpu usage. Is this what you are referring to?