Creating Index for PDF Files Using Typesense
TLDR Greg asked for advice on using TypeSense to index PDFs. Jason suggested using Tika for text extraction before using Typesense. After discussing the details, Greg decided to extract and parse data directly into Typesense using Tika.
Sep 26, 2022 (15 months ago)
Thoughts on an easier way to approach the problem? I don’t think the docsearch scraper will get me what I want but perhaps it will.
Thanks for your feedback!
A couple of things to keep in mind with docsearch:
• The record structure created by docsearch is good, but is opinionated and built to work with Docsearch.js.
• docsearch-scraper does not work with web-servers running on non-standard port numbers (anything besides 80 and 443) due to an issue with the underlying Python scraping library used
Indexed 3011 threads (79% resolved)
Solving Typesense Docsearch Scraper Issues
Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.
Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
Docsearch Scrapper Metadata Configuration and Filter Problem
Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.