I've been experimenting with various search libs (...
# community-help
j
I've been experimenting with various search libs (sonic, meili, typesense, etc.), so far typesense is the easiest to get working with raw hex strings (20 byte blockchain addresses encoded as hex), but the query performance is slower than expected: i've indexed 50 million items as a test, each one looks something like:
{"dbid": 1337, "address": "64a43130af34f9150030f2a2509a9efbd07fe372"}
querying for "000000" returns 4 items in ~200ms (12 cores, 128gb ram, 4x2TB RAID 0) 200ms is pretty decent, but not "amazing". an in-memory ART (adaptive radix trie, which i believe typesense also uses) can return this in a few ms. does 200ms seem in-line with your expectations?
j
Could you share all the search query params you’re using?
And also the exact collection schema?
j
QUERY:
Copy code
curl "<http://localhost:8108/collections/addresses/documents/search?q=000000&query_by=address>"
SCHEMA:
Copy code
curl "<http://localhost:8108/collections>" \
  -X POST \
  -H "Content-Type: application/json" '{
    "name": "addresses",
    "fields": [
      {"name": "dbid", "type": "int64" },
      {"name": "address", "type": "string" }
    ],
    "default_sorting_field": "dbid"
  }'
(also running via docker label
0.24.0.rcn28
, i meant to try outside of docker but haven't yet)
j
Could you try adding these additional search params:
num_typos=0 & typo_tokens_threshold=0 & drop_tokens_threshold=0 & prioritize_exact_match=false & highlight_fields=none
(space added for readability) and see if that makes a difference performance-wise
I’ve anecdotally seen slower performance when run via Docker… But could you make sure that the Docker runtime is allowed to use all the cores and memory on the host machine?
If that also doesn’t help, could you check if running natively on the host makes a difference?
j
adding those fields did not change performance at all, testing docker stuff now
from 200ms (docker) to 280ms (running on host directly),
typesense-server-0.23.1-linux-amd64.tar.gz
that's with a fresh data directory, new index, and restart after creating index. surprising result
j
Hmmm! That was unexpected
@Kishore Nallan Any idea what’s happening here ^
Btw, could you post the output of
GET /metrics.json
?
j
Copy code
{
  "system_cpu10_active_percentage": "0.00",
  "system_cpu11_active_percentage": "9.09",
  "system_cpu12_active_percentage": "0.00",
  "system_cpu13_active_percentage": "9.09",
  "system_cpu14_active_percentage": "0.00",
  "system_cpu15_active_percentage": "9.09",
  "system_cpu16_active_percentage": "0.00",
  "system_cpu17_active_percentage": "10.00",
  "system_cpu18_active_percentage": "0.00",
  "system_cpu19_active_percentage": "9.09",
  "system_cpu1_active_percentage": "27.27",
  "system_cpu20_active_percentage": "0.00",
  "system_cpu21_active_percentage": "0.00",
  "system_cpu22_active_percentage": "0.00",
  "system_cpu23_active_percentage": "0.00",
  "system_cpu24_active_percentage": "0.00",
  "system_cpu2_active_percentage": "25.00",
  "system_cpu3_active_percentage": "10.00",
  "system_cpu4_active_percentage": "10.00",
  "system_cpu5_active_percentage": "0.00",
  "system_cpu6_active_percentage": "9.09",
  "system_cpu7_active_percentage": "9.09",
  "system_cpu8_active_percentage": "9.09",
  "system_cpu9_active_percentage": "0.00",
  "system_cpu_active_percentage": "6.10",
  "system_disk_total_bytes": "7610737090560",
  "system_disk_used_bytes": "3837115981824",
  "system_memory_total_bytes": "134997864448",
  "system_memory_used_bytes": "71718522880",
  "system_network_received_bytes": "0",
  "system_network_sent_bytes": "0",
  "typesense_memory_active_bytes": "11111964672",
  "typesense_memory_allocated_bytes": "11072338904",
  "typesense_memory_fragmentation_ratio": "0.00",
  "typesense_memory_mapped_bytes": "11397263360",
  "typesense_memory_metadata_bytes": "226870128",
  "typesense_memory_resident_bytes": "11111964672",
  "typesense_memory_retained_bytes": "1533775872"
}
j
Is the 200ms specific to the query
000000
? Could you try a random set of other strings to see if it’s consistent?
j
very good q
oh wow, that may be it
wtf
all other queries are coming back in 0ms, lightning fast
j
😅
j
wow
beautiful
you made my day
😄 1
🙌 1
thank you
👍 1
k
There's a bunch of stuff we do with prefix searching that is not as straightforward as simply using an ART index directly. For e.g. we also sort words that match a prefix based on their frequency/popularity of occurrence. So certain popular prefixes could be a bit slower.
j
(this particular prefix,
00000
only had 4 matches in 50 million documents so it may not be due to that but i acknowledge your point) i'm surprised and impressed that raw hex strings worked so well with typesense. most other search libraries couldn't handle it