Hi! I have formed a dataset of 120GB of podcast fe...
# community-help
a
Hi! I have formed a dataset of 120GB of podcast feeds, 2M podcasts, ~50M episodes and am trying to make them searchable. For anyone who listens to podcasts, the search experience in this space is usually very bad and slow, while it needs to be very high quality to be able to discover new content. Now I am trying to find out what exactly uses the most resources in Typesense to optimize this. What factors are mostly responsible for RAM usage? Here some of my ideas: • Does field name size make a difference? • Does size variability within the same field make a difference? • Can I reduce the size by throwing out stopwords and fillers from descriptions without loosing search quality? I think this could be something interesting in general. If this stuff is not that benchmarked I could help run some tests.
k
👋 Field name won't make any difference. Size variability also won't matter. The main memory consumption would stem from the actual text index where we need to store the document and positions where each word appears in the corpus (called the inverted index). By removing stop words you can certainly save memory.
a
Thank you for your response. Is that something worth benchmarking? From the speed of your response I guess you know this stuff fully already 🙂
k
Yes ofc I write Typesense 😄
s
Is this use case require a Vector Search . I have seen Spotify blog post which talks about solving this using the same https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/
a
Sure, I was just curious if that's an opportunity to get involved
k
My suggestion is to benchmark 1M records first and then use that to develop a sense of the memory usage.
s
Here is step by step tutorial to do the same -

https://www.youtube.com/watch?v=ok0SDdXdat8

a
@satish venkatakrishnan That looks very interesting
s
My understanding is Typesense dont support vector search yet . But Jason told thats the area of interest
k
With vector search you will probably get really good semantically good search results, but it will not support type ahead / instant searching. So it depends on what kind of UX you are looking for.
Also typo correction will have to be done outside as well.
a
Maybe two different searches can be used for each purpose
s
@Kishore Nallan Yes you are right. But Podcast search generally involve a Topic name and we want to get all the episodes /podcast related to the same. Again me being the user I want that 🙂
k
Yup yup, certainly instant search is certainly not great for everything.
s
So that said, When can we get Vector search in Typesense 🙂
a
@Kishore Nallan I am planing to use it though. Type ahead is great! I want it to feel snappy
k
Relevant part from that article above:
Although Dense Retrieval / Natural Language Search has very interesting properties, it often fails to perform as well as traditional IR methods on exact term matching (and is also more expensive to run on all queries). That’s why we decided to make our Natural Language Search an additional source rather than just replace our other retrieval sources (including our Elasticsearch cluster).
a
That is actually very relevant
On another question: How would I go about a ranking system i.e. Likes. If I have a search term and the number of likes or other factors, it is my understanding that I have to tweak the weights until I feel the results are correct. Is there a more scientific method to this than just random parameter optimization?
k
That's a very tricky problem. You need either enough search volume to identify popular records or find a proxy for the podcast popularity: maybe your primary source has number of stars / rating?
One reason Google is really good is because they have truck load of real user search. Even in the Spotify example above, they use actual (query, record_clicked) logs to train the ML model. But it's a chicken and egg problem when you are bootstrapping the search system initially so you need to try and find the popularity score from the primary data source somehow.
a
Sure I can calculate some value of popularity, maybe even from the primary source. But I still have to mix it in with relevancy don't I?
k
Given 2 documents both match all query tokens, the simplest start is to rank the records by the popularity score. Once you have enough real world search and click data, eventually you can build a ML mode for predicting this more broadly.
s
@Alexander Zierhut Is this data available publically where we can work together on solving this problem
a
@satish venkatakrishnan I'll write you a pm