#community-help

Preparing Data for Indexing in TypeSense

TLDR Charles sought advice for preparing diverse data for TypeSense indexing. Jason suggested creating documents for similar fields and extracting text from PDF into paragraphs.

Powered by Struct AI
11
18mo
Solved
Join the chat
Jun 21, 2022 (18 months ago)
Charles
Photo of md5-3a8311fe91486b1c95f9ab3c7ca33f1c
Charles
03:53 PM
Hello everyone, I am just starting with TypeSense I was wondering how I should prepare the data for the indexing, I have many with different layout... How would you proceed? thanks in advance
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:56 PM
In general, you'd typically want to put all documents with the same fields and attribute types in one collection. Similar to what you'd do in a relational database. A table is roughly equivalent to a collection
Charles
Photo of md5-3a8311fe91486b1c95f9ab3c7ca33f1c
Charles
03:56 PM
Let's imagine I have books in PDF, how would you proceed?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:57 PM
You want to extract the text out of the PDF into paragraphs (there are libraries that do this for you), then create one document per paragraph in Typesense to get the most relevance
Charles
Photo of md5-3a8311fe91486b1c95f9ab3c7ca33f1c
Charles
03:59 PM
let's imagine this as an example:
03:59
Charles
03:59 PM
03:59
Charles
03:59 PM
and many pages like this
04:00
Charles
04:00 PM
you would create blocks (heading+paragraphs) together
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:00 PM
Yup
Charles
Photo of md5-3a8311fe91486b1c95f9ab3c7ca33f1c
Charles
04:00 PM
then here we have several paragraphes, would you ut the same heading for several paragraphes?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:01 PM
If you need to show the paragraph heading next to each search result, yes.