#community-help

Public Datasets for Typesense Testing

TLDR Joshua offers publicly accessible JSON files full of podcast data for Typesense testing. Jason and Janaka express interest, but they discuss potential licensing issues. They conclude the data is likely in the public domain.

Powered by Struct AI

2

1

14
24mo
Solved
Join the chat
Oct 07, 2021 (25 months ago)
Joshua
Photo of md5-fc2280639a535856f901b06b4928137a
Joshua
03:59 PM
Is there any interest in fairly large publicly available datasets ready to import to Typesense for testing?

I'm working on a search index for podcasts and am building a collection of ready to use JSON files that will be loaded with LOTS of data that I can make available for download if there's interest in that.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:46 PM
Oooh, yes I would definitely appreciate that Joshua! Would love to increase my repertoire of datasets to do tests with
Joshua
Photo of md5-fc2280639a535856f901b06b4928137a
Joshua
05:26 PM
Cool. Working on this currently. Hope to have something pushed out next week ๐Ÿคž

1

Oct 09, 2021 (25 months ago)
Janaka
Photo of md5-3ecfadbfb82a962691e2d6cb42f876b4
Janaka
03:53 PM
This is an interesting idea from the perspective of my universal personal search engine.
Oct 11, 2021 (25 months ago)
Joshua
Photo of md5-fc2280639a535856f901b06b4928137a
Joshua
02:51 PM
Alright, I created a bunch of JSON files based on podcasts updated in the past 3 months according to Podcast Index. I've tested importing (a subset of) these into Typesense. The files and info about the setup can be found here. Let me know if you have questions. ๐Ÿ˜ƒ
https://files.srch.cc/file/srch-files/files.html
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:26 PM
Thanks Joshua! Just to be sure, may I ask what license you're releasing these under? ๐Ÿ™‚
08:29
Jason
08:29 PM
On a side note, it would help improve relevancy and performance if you're able to break out the podcast transcripts into say paragraphs, or by speakers, and index each as a separate record
Joshua
Photo of md5-fc2280639a535856f901b06b4928137a
Joshua
08:41 PM
Jason good/fair question on the license...i'm not sure...i mean, the content is not mine so i'm not sure how one goes about licensing something like this
08:42
Joshua
08:42 PM
yeah, most podcasts don't do transcripts, but the ones that do have a TON of content in there so breaking that out would be much better...this was more about getting the initial set of files in a place where they could be imported - lots to improve on from there ๐Ÿ˜บ

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:58 PM
It looks like podcastindex is an open database, though they don't specify a license explicitly. They say "The core, categorized index will always be available for free, for any use", which I guess puts it in the public domain
08:59
Jason
08:59 PM
In any case, thank you for making these JSONL files available!

1

Nov 02, 2021 (25 months ago)
Janaka
Photo of md5-3ecfadbfb82a962691e2d6cb42f876b4
Janaka
10:47 PM
Re: license, I'm no way an expert, but if the content is available on the Web and its just a search index, that's no different from Google indexing right. So licensing shouldn't be an issue?
Nov 03, 2021 (24 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:41 AM
Good point and that sounds reasonable to me... but IANAL!
Joshua
Photo of md5-fc2280639a535856f901b06b4928137a
Joshua
04:09 PM
janaka, that is my thought too...these are all publicly available rss feeds for podcasts