Is there any interest in fairly large publicly ava...
# community-help
j
Is there any interest in fairly large publicly available datasets ready to import to Typesense for testing? I'm working on a search index for podcasts and am building a collection of ready to use JSON files that will be loaded with LOTS of data that I can make available for download if there's interest in that.
j
Oooh, yes I would definitely appreciate that @Joshua Hoover! Would love to increase my repertoire of datasets to do tests with
j
Cool. Working on this currently. Hope to have something pushed out next week 🤞
🙏 1
j
This is an interesting idea from the perspective of my universal personal search engine.
j
Alright, I created a bunch of JSON files based on podcasts updated in the past 3 months according to Podcast Index. I've tested importing (a subset of) these into Typesense. The files and info about the setup can be found here. Let me know if you have questions. 😃 https://files.srch.cc/file/srch-files/files.html
j
Thanks @Joshua Hoover! Just to be sure, may I ask what license you're releasing these under? 🙂
On a side note, it would help improve relevancy and performance if you're able to break out the podcast transcripts into say paragraphs, or by speakers, and index each as a separate record
j
@Jason Bosco good/fair question on the license...i'm not sure...i mean, the content is not mine so i'm not sure how one goes about licensing something like this
yeah, most podcasts don't do transcripts, but the ones that do have a TON of content in there so breaking that out would be much better...this was more about getting the initial set of files in a place where they could be imported - lots to improve on from there 😺
👍 1
j
It looks like podcastindex is an open database, though they don't specify a license explicitly. They say "The core, categorized index will always be available for free, for any use", which I guess puts it in the public domain
In any case, thank you for making these JSONL files available!
👍 1
j
Re: license, I'm no way an expert, but if the content is available on the Web and its just a search index, that's no different from Google indexing right. So licensing shouldn't be an issue?
j
Good point and that sounds reasonable to me... but IANAL!
j
janaka, that is my thought too...these are all publicly available rss feeds for podcasts