Hi I have formed a dataset of 120GB of podcast feeds 2M podc typesense #community-help

Hi! I have formed a dataset of 120GB of podcast fe...

Alexander Zierhut

05/02/2022, 1:30 PM

Hi! I have formed a dataset of 120GB of podcast feeds, 2M podcasts, ~50M episodes and am trying to make them searchable. For anyone who listens to podcasts, the search experience in this space is usually very bad and slow, while it needs to be very high quality to be able to discover new content. Now I am trying to find out what exactly uses the most resources in Typesense to optimize this. What factors are mostly responsible for RAM usage? Here some of my ideas: • Does field name size make a difference? • Does size variability within the same field make a difference? • Can I reduce the size by throwing out stopwords and fillers from descriptions without loosing search quality? I think this could be something interesting in general. If this stuff is not that benchmarked I could help run some tests.

Kishore Nallan

05/02/2022, 1:33 PM

👋 Field name won't make any difference. Size variability also won't matter. The main memory consumption would stem from the actual text index where we need to store the document and positions where each word appears in the corpus (called the inverted index). By removing stop words you can certainly save memory.

Alexander Zierhut

05/02/2022, 1:34 PM

Thank you for your response. Is that something worth benchmarking? From the speed of your response I guess you know this stuff fully already 🙂

Kishore Nallan

05/02/2022, 1:35 PM

Yes ofc I write Typesense 😄

satish venkatakrishnan

05/02/2022, 1:37 PM

Is this use case require a Vector Search . I have seen Spotify blog post which talks about solving this using the same https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/

Alexander Zierhut

05/02/2022, 1:37 PM

Sure, I was just curious if that's an opportunity to get involved

Kishore Nallan

05/02/2022, 1:37 PM

My suggestion is to benchmark 1M records first and then use that to develop a sense of the memory usage.

satish venkatakrishnan

05/02/2022, 1:38 PM

Here is step by step tutorial to do the same -

https://www.youtube.com/watch?v=ok0SDdXdat8▾

Alexander Zierhut

05/02/2022, 1:38 PM

@satish venkatakrishnan That looks very interesting

satish venkatakrishnan

05/02/2022, 1:39 PM

My understanding is Typesense dont support vector search yet . But Jason told thats the area of interest

Kishore Nallan

05/02/2022, 1:40 PM

With vector search you will probably get really good semantically good search results, but it will not support type ahead / instant searching. So it depends on what kind of UX you are looking for.

Kishore Nallan

05/02/2022, 1:40 PM

Also typo correction will have to be done outside as well.

Alexander Zierhut

05/02/2022, 1:41 PM

Maybe two different searches can be used for each purpose

satish venkatakrishnan

05/02/2022, 1:42 PM

@Kishore Nallan Yes you are right. But Podcast search generally involve a Topic name and we want to get all the episodes /podcast related to the same. Again me being the user I want that 🙂

Kishore Nallan

05/02/2022, 1:44 PM

Yup yup, certainly instant search is certainly not great for everything.

satish venkatakrishnan

05/02/2022, 1:45 PM

So that said, When can we get Vector search in Typesense 🙂

Alexander Zierhut

05/02/2022, 1:45 PM

@Kishore Nallan I am planing to use it though. Type ahead is great! I want it to feel snappy

Kishore Nallan

05/02/2022, 1:46 PM

Relevant part from that article above:

Although Dense Retrieval / Natural Language Search has very interesting properties, it often fails to perform as well as traditional IR methods on exact term matching (and is also more expensive to run on all queries). That’s why we decided to make our Natural Language Search an additional source rather than just replace our other retrieval sources (including our Elasticsearch cluster).

Alexander Zierhut

05/02/2022, 1:47 PM

That is actually very relevant

Alexander Zierhut

05/02/2022, 1:49 PM

On another question: How would I go about a ranking system i.e. Likes. If I have a search term and the number of likes or other factors, it is my understanding that I have to tweak the weights until I feel the results are correct. Is there a more scientific method to this than just random parameter optimization?

Kishore Nallan

05/02/2022, 1:51 PM

That's a very tricky problem. You need either enough search volume to identify popular records or find a proxy for the podcast popularity: maybe your primary source has number of stars / rating?

Kishore Nallan

05/02/2022, 1:52 PM

One reason Google is really good is because they have truck load of real user search. Even in the Spotify example above, they use actual (query, record_clicked) logs to train the ML model. But it's a chicken and egg problem when you are bootstrapping the search system initially so you need to try and find the popularity score from the primary data source somehow.

Alexander Zierhut

05/02/2022, 1:52 PM

Sure I can calculate some value of popularity, maybe even from the primary source. But I still have to mix it in with relevancy don't I?

Kishore Nallan

05/02/2022, 1:54 PM

Given 2 documents both match all query tokens, the simplest start is to rank the records by the popularity score. Once you have enough real world search and click data, eventually you can build a ML mode for predicting this more broadly.

satish venkatakrishnan

05/02/2022, 1:58 PM

@Alexander Zierhut Is this data available publically where we can work together on solving this problem

Alexander Zierhut

05/02/2022, 1:59 PM

@satish venkatakrishnan I'll write you a pm

Open in Slack

Previous Next