Optimizing Dataset of Podcast Feeds for a Searchable Database
TLDR Alexander seeks advice on optimizing a podcast database for search. Kishore Nallan suggests data size and stopwords impact RAM usage, and that benchmarking on 1M records would be useful. satish raises the potential need for vector searching. Both recommend feeding user activity data into ML models for relevancy ranking. Collaboration was suggested.
May 02, 2022 (18 months ago)
Alexander
01:30 PMHere some of my ideas:
• Does field name size make a difference?
• Does size variability within the same field make a difference?
• Can I reduce the size by throwing out stopwords and fillers from descriptions without loosing search quality?
I think this could be something interesting in general. If this stuff is not that benchmarked I could help run some tests.
Kishore Nallan
01:33 PMBy removing stop words you can certainly save memory.
Alexander
01:34 PMKishore Nallan
01:35 PMsatish
01:37 PMAlexander
01:37 PMKishore Nallan
01:37 PMsatish
01:38 PMAlexander
01:38 PMsatish
01:39 PMKishore Nallan
01:40 PMKishore Nallan
01:40 PMAlexander
01:41 PMsatish
01:42 PMKishore Nallan
01:44 PMsatish
01:45 PMAlexander
01:45 PMKishore Nallan
01:46 PM> Although Dense Retrieval / Natural Language Search has very interesting properties, it often fails to perform as well as traditional IR methods on exact term matching (and is also more expensive to run on all queries). That’s why we decided to make our Natural Language Search an additional source rather than just replace our other retrieval sources (including our Elasticsearch cluster).
Alexander
01:47 PMAlexander
01:49 PMKishore Nallan
01:51 PMKishore Nallan
01:52 PMAlexander
01:52 PMKishore Nallan
01:54 PMsatish
01:58 PMAlexander
01:59 PMTypesense
Indexed 2779 threads (79% resolved)
Similar Threads
Integrating Semantic Search with Typesense
Krish wants to integrate a semantic search functionality with typesense but struggles with the limitations. Kishore Nallan provides resources, clarifications and workarounds to the raised issues.
Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense
Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.
Understanding Vector Search with Typesense
In a chat with em1nos and Andrew, Kishore Nallan explained how Vector Search works. He clarified that it can be useful for recommendations and personalization, but it requires machine learning to convert data into vectors before searching.
Improving Search Relevance with Typesense
Viktor asks how Typesense calculates relevance and Jason suggests using vector search, specifically S-BERT embeddings, to better match low information queries to relevant documents.
Adjusting Text Match Score Calculation in TypeSense
Johannes wanted to modify the Text Match Score calculation in TypeSense to improve search results returns. With counsel from Jason and Kishore Nallan, various solutions were proposed, including creating a Github issue, attempting different parameters, and updating Docker to a new version to resolve the matter.