Clarification on RAFT and Trie in GitHub

TLDR Prabhat asked Kishore Nallan about Trie maintenance, durability, and sharding in GitHub. Kishore Nallan explained the in-memory storage of indexing data and provided a relevant source code link.

Powered by Struct AI
Join the chat
Mar 17, 2022 (18 months ago)
Photo of md5-2cf1436f7941129c670128dc3c1cf6a5
07:43 AM
Hi Kishore Nallan 👋
Attended your awesome talk today in GitHub, I asked few Qns related to RAFT if you remember but I’ve a bunch of other Qns like Is Trie always maintained in memory, how do you ensure durability of trie while Indexing, Is sharding of tries also possible etc. Can you point me to any design doc or something which I can read to get more info or point me to relevant code folders where I can dig up info myself?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:48 AM
👋 Glad you liked the talk.

1. All indexing data structures are stored in-memory, including the Trie. Here's the trie implementation, which is forked off a simpler library: https://github.com/typesense/typesense/blob/master/src/art.cpp
2. The trie is reconstructed on start, only raw documents are stored on disk. This allows us to modify / introduce new datastructures without the baggage of migrating on-disk structures, which can be cumbersome. The downside is that there is some "boostrapping" time as the indexes are built from scratch from the raw documents. But this is again a trade-off chosen specifically for the kind of uses cases and datasets we've chosen to support.