I am doing some benchmarking for Typesense by doing some tex typesense #community-help

I am doing some benchmarking for Typesense by doin...

Elliot Sawyer

08/24/2024, 10:18 AM

I am doing some benchmarking for Typesense by doing some text extraction with Apache Tika. The test is basically to index ~100 books from Project Gutenburg and run some searches on some famous quotes, such as "human beings must love something". I'm starting with a single book, Jane Eyre. What's the optimal way to build a collection for this? Tika reads the epub and spits out the entire book as ~538kb of text, so I'm just splitting that into sentences and storing the array of strings in a Typesense collection (schema below). As I start typing the first few characters, the search is noticeably slow to respond, and indeed takes several seconds on a localhost with 8 cores and 32GB of RAM. Instantsearch won't even load the collection on my remote server because it times out (4 cores available). I'm not terribly surprised by this - I am searching for a needle in a haystack of words, after all - but is there a better way to configure Typesense to optimise a search like this? The search query is basically looking for "books that have these words in them", so there could be multiple books returned if there's a common keyword.

Copy code

{
  "name": "Files",
  "fields": [
    {
      "name": "Title",
      "type": "string",
      "facet": false,
      "optional": false,
      "index": true,
      "sort": true,
      "infix": false,
      "locale": "en",
      "stem": false
    },
    {
      "name": "Content",
      "type": "string[]",
      "facet": false,
      "optional": true,
      "index": true,
      "sort": false,
      "infix": false,
      "locale": "en",
      "stem": false
    }
  ],
  "default_sorting_field": "",
  "enable_nested_fields": true,
  "symbols_to_index": [],
  "token_separators": []
}

Kishore Nallan

08/24/2024, 10:31 AM

Storing an entire book into a single field as arrays is not efficient. Instead try indexing chunks of 100 words into individual document with a parent book id. Eventually you want to do a group by on book id.

👌 1

Kishore Nallan

08/24/2024, 10:32 AM

The other option is to use joins to map book content with metadata like title that's common.

Elliot Sawyer

08/24/2024, 10:32 AM

Yeah, was just thinking that, I do audiobooks in a similar way, line-by-line

Elliot Sawyer

08/24/2024, 10:33 AM

At the moment the book is split into sentences... Jane Eyre is an array of 6526 strings, for example

Elliot Sawyer

08/24/2024, 10:33 AM

with the audiobooks, I index the lines and facet on title/author

Elliot Sawyer

08/24/2024, 10:34 AM

Could probably do the same here. Thanks for the tip!

👍 1

Elliot Sawyer

08/24/2024, 10:59 AM

Is it fair to say that if I wasn't searching the contents of a book, but instead a smaller document like a 2-3 page PDF... it would be way more efficient?

Kishore Nallan

08/24/2024, 11:03 AM

The problem is that arrays are not meant for storing long lengths of text across multiple array indices. Try storing everything as a single string field is better but that would mean you won't get multiple matches per document which array gets you but in an inefficient way.

Kishore Nallan

08/24/2024, 11:04 AM

So the recommended approach is to store chunks as individual documents in single string fields. For smaller documents array is fine.

Kishore Nallan

08/24/2024, 11:04 AM

So maybe PDFs will fit that bucket.

Elliot Sawyer

08/24/2024, 11:06 AM

Interesting! I tried doing the single string initially and found exactly what you describe. Still slow but it does work, but it can't highlight the match because its enormous 👍

Elliot Sawyer

08/24/2024, 11:06 AM

Also explains why PDFs worked well in my initial tests. Most of the people looking at my projects are owners of govt websites with meeting agendas and things of that nature

Elliot Sawyer

08/24/2024, 11:07 AM

Still, super useful talking point

Open in Slack

Previous Next