#community-help

Creating Index for PDF Files Using Typesense

TLDR Greg asked for advice on using TypeSense to index PDFs. Jason suggested using Tika for text extraction before using Typesense. After discussing the details, Greg decided to extract and parse data directly into Typesense using Tika.

Powered by Struct AI

1

1

7
13mo
Solved
Join the chat
Sep 26, 2022 (13 months ago)
Greg
Photo of md5-681af81b94d6f3814dfc75a070b04432
Greg
04:45 PM
Looking to use TypeSense to create an index of 4000+ PDF files and was planning to use this approach but TypeSense instead of Algolia: https://stories.algolia.com/indexing-pdf-or-other-file-contents-for-searching-b2499c23568f

Thoughts on an easier way to approach the problem? I don’t think the docsearch scraper will get me what I want but perhaps it will.

Thanks for your feedback!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:41 PM
Greg You’d definitely want to use something like Tika to extract the text out of the PDF files. Looks like the output of Tika can be HTML, in which case if you’re able to output say one html file per chapter, serve that under a localhost domain as a static site, you can then use typesense-docsearch-scraper to create the actual records for you.

A couple of things to keep in mind with docsearch:

• The record structure created by docsearch is good, but is opinionated and built to work with Docsearch.js.
• docsearch-scraper does not work with web-servers running on non-standard port numbers (anything besides 80 and 443) due to an issue with the underlying Python scraping library used
Greg
Photo of md5-681af81b94d6f3814dfc75a070b04432
Greg
06:26 PM
That’s great info… feels like more work than is worth the effort. I’m thinking we’ll just use Tika to extract it and parse it then pump it directly into TypeSense. Here’s one of the actual files and we’re really only interested in cross referencing the table data starting on Page 5.

https://public.kimballincslc.com/vikingschematics/new/BQC300T.pdf
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:33 PM
Looks like it’s already semi-structured data, especially if it’s in a table. So yeah, docsearch-scraper might be overkill
07:36
Jason
07:36 PM
There are also a couple of SaaS services that seem to do this btw - extract parseable data from PDF files
Greg
Photo of md5-681af81b94d6f3814dfc75a070b04432
Greg
08:20 PM
Thankfully the PDF docs are pretty static so I think Tika will get us where we need to go.
08:20
Greg
08:20 PM
Appreciate the guidance and feedback! Glad we’re going the Typesense route!

1

1