I am doing some benchmarking for Typesense by doin...
# community-help
e
I am doing some benchmarking for Typesense by doing some text extraction with Apache Tika. The test is basically to index ~100 books from Project Gutenburg and run some searches on some famous quotes, such as "human beings must love something". I'm starting with a single book, Jane Eyre. What's the optimal way to build a collection for this? Tika reads the epub and spits out the entire book as ~538kb of text, so I'm just splitting that into sentences and storing the array of strings in a Typesense collection (schema below). As I start typing the first few characters, the search is noticeably slow to respond, and indeed takes several seconds on a localhost with 8 cores and 32GB of RAM. Instantsearch won't even load the collection on my remote server because it times out (4 cores available). I'm not terribly surprised by this - I am searching for a needle in a haystack of words, after all - but is there a better way to configure Typesense to optimise a search like this? The search query is basically looking for "books that have these words in them", so there could be multiple books returned if there's a common keyword.
Copy code
{
  "name": "Files",
  "fields": [
    {
      "name": "Title",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": true,
      "infix": false,
      "locale": "en",
      "stem": false
    },
    {
      "name": "Content",
      "type": "string[]",
      "facet": false,
      "optional": true,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "en",
      "stem": false
    }
  ],
  "default_sorting_field": "",
  "enable_nested_fields": true,
  "symbols_to_index": [],
  "token_separators": []
}
k
Storing an entire book into a single field as arrays is not efficient. Instead try indexing chunks of 100 words into individual document with a parent book id. Eventually you want to do a group by on book id.
👌 1
The other option is to use joins to map book content with metadata like title that's common.
e
Yeah, was just thinking that, I do audiobooks in a similar way, line-by-line
At the moment the book is split into sentences... Jane Eyre is an array of 6526 strings, for example
with the audiobooks, I index the lines and facet on title/author
Could probably do the same here. Thanks for the tip!
👍 1
Is it fair to say that if I wasn't searching the contents of a book, but instead a smaller document like a 2-3 page PDF... it would be way more efficient?
k
The problem is that arrays are not meant for storing long lengths of text across multiple array indices. Try storing everything as a single string field is better but that would mean you won't get multiple matches per document which array gets you but in an inefficient way.
So the recommended approach is to store chunks as individual documents in single string fields. For smaller documents array is fine.
So maybe PDFs will fit that bucket.
e
Interesting! I tried doing the single string initially and found exactly what you describe. Still slow but it does work, but it can't highlight the match because its enormous 👍
Also explains why PDFs worked well in my initial tests. Most of the people looking at my projects are owners of govt websites with meeting agendas and things of that nature
Still, super useful talking point