We are a cybersecurity company that provides data ...
# community-help
c
We are a cybersecurity company that provides data breach intelligence services for organizations. We are exploring new options to search our datasets. We have 100+ terabytes of files in various formats, including txt, csv, json, pdf, sql, rar, zip, html, etc. We are looking for service options to either upload all our data to the service or point to our cloud storage, to have the data indexed and searchable based on specific categories we look for, including: names, company names, emails, domain names, phone numbers, usernames, passwords, crypto addresses, credit card numbers, etc. We store the data indefinitely. We need to search the data via API's provided by the service. Is this a good use case for Typesense?
j
Typesense is an in-memory search engine. So putting 100+ TB of data in RAM is going to be very cost prohibitive. I would recommend using Elasticsearch for your use-case.
👍 1
c
Thank you for the quick reply. Do you have any other suggestions besides Elasticsearch/Opensearch?
j
I wish! Unfortunately for that scale of data Elasticsearch is the most battle tested
And also cost effective, because it uses disk based indices
❤️ 1
There's also Solr, but I've only typically heard of it for site/app search. But then Solr just uses Lucene under the hood like Elasticsearch
c
Have you heard anything about Quickwhit or Seekstorm?
j
I have heard of Quickwit, but haven't used it myself to give you an opinion. Seekstorm haven't heard of it.
👍 1
I believe Datadog acquired Quickwit recently
s
Just an idea… you mention zips, which as such aren’t searchable anyway Probably the biggest chunks are rar and zip? So you’d just want to find the zip name. I suspect many parts of your data are similar? Thus you could maybe bring down your data size massively by simply indexing certain „keys“ of the data? Plus, you could (and probably should, for vector based search) summarize your readable contents such as txt files or pdf. Eventually you could shrink the dataset indexed by a lot. Just a thought.
❤️ 1