#community-help

Optimizing Dataset of Podcast Feeds for a Searchable Database

TLDR Alexander seeks advice on optimizing a podcast database for search. Kishore Nallan suggests data size and stopwords impact RAM usage, and that benchmarking on 1M records would be useful. satish raises the potential need for vector searching. Both recommend feeding user activity data into ML models for relevancy ranking. Collaboration was suggested.

Powered by Struct AI
26
18mo
Solved
Join the chat
May 02, 2022 (18 months ago)
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:30 PM
Hi! I have formed a dataset of 120GB of podcast feeds, 2M podcasts, ~50M episodes and am trying to make them searchable. For anyone who listens to podcasts, the search experience in this space is usually very bad and slow, while it needs to be very high quality to be able to discover new content. Now I am trying to find out what exactly uses the most resources in Typesense to optimize this. What factors are mostly responsible for RAM usage?
Here some of my ideas:
• Does field name size make a difference?
• Does size variability within the same field make a difference?
• Can I reduce the size by throwing out stopwords and fillers from descriptions without loosing search quality?
I think this could be something interesting in general. If this stuff is not that benchmarked I could help run some tests.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:33 PM
👋 Field name won't make any difference. Size variability also won't matter. The main memory consumption would stem from the actual text index where we need to store the document and positions where each word appears in the corpus (called the inverted index).

By removing stop words you can certainly save memory.
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:34 PM
Thank you for your response. Is that something worth benchmarking? From the speed of your response I guess you know this stuff fully already 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:35 PM
Yes ofc I write Typesense 😄
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:37 PM
Is this use case require a Vector Search . I have seen Spotify blog post which talks about solving this using the same https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:37 PM
Sure, I was just curious if that's an opportunity to get involved
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:37 PM
My suggestion is to benchmark 1M records first and then use that to develop a sense of the memory usage.
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:38 PM
Here is step by step tutorial to do the same - https://www.youtube.com/watch?v=ok0SDdXdat8
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:38 PM
satish That looks very interesting
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:39 PM
My understanding is Typesense dont support vector search yet . But Jason told thats the area of interest
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:40 PM
With vector search you will probably get really good semantically good search results, but it will not support type ahead / instant searching. So it depends on what kind of UX you are looking for.
01:40
Kishore Nallan
01:40 PM
Also typo correction will have to be done outside as well.
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:41 PM
Maybe two different searches can be used for each purpose
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:42 PM
Kishore Nallan Yes you are right. But Podcast search generally involve a Topic name and we want to get all the episodes /podcast related to the same. Again me being the user I want that 🙂
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:44 PM
Yup yup, certainly instant search is certainly not great for everything.
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:45 PM
So that said, When can we get Vector search in Typesense 🙂
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:45 PM
Kishore Nallan I am planing to use it though. Type ahead is great! I want it to feel snappy
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:46 PM
Relevant part from that article above:

> Although Dense Retrieval / Natural Language Search has very interesting properties, it often fails to perform as well as traditional IR methods on exact term matching (and is also more expensive to run on all queries). That’s why we decided to make our Natural Language Search an additional source rather than just replace our other retrieval sources (including our Elasticsearch cluster).
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:47 PM
That is actually very relevant
01:49
Alexander
01:49 PM
On another question: How would I go about a ranking system i.e. Likes. If I have a search term and the number of likes or other factors, it is my understanding that I have to tweak the weights until I feel the results are correct. Is there a more scientific method to this than just random parameter optimization?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:51 PM
That's a very tricky problem. You need either enough search volume to identify popular records or find a proxy for the podcast popularity: maybe your primary source has number of stars / rating?
01:52
Kishore Nallan
01:52 PM
One reason Google is really good is because they have truck load of real user search. Even in the Spotify example above, they use actual (query, record_clicked) logs to train the ML model. But it's a chicken and egg problem when you are bootstrapping the search system initially so you need to try and find the popularity score from the primary data source somehow.
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:52 PM
Sure I can calculate some value of popularity, maybe even from the primary source. But I still have to mix it in with relevancy don't I?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:54 PM
Given 2 documents both match all query tokens, the simplest start is to rank the records by the popularity score. Once you have enough real world search and click data, eventually you can build a ML mode for predicting this more broadly.
satish
Photo of md5-21068ce5c0a7db9d103fad551dbefbc7
satish
01:58 PM
Alexander Is this data available publically where we can work together on solving this problem
Alexander
Photo of md5-8ac455f6d302407f03def4c775778b28
Alexander
01:59 PM
satish I'll write you a pm