Hello :) I was wondering does typesense have a way...
# community-help
s
Hello :) I was wondering does typesense have a way to ignore strings in a document ? We store HTML data and would like to ignore those HTML tags as well as any HTML attributes when performing search. Eg. If I search for ‘mark’ then I shouldn’t get hits for ‘<mark>...’
k
Is there a reason you can't remove those HTML tags before indexing the test.
s
So we store HTML in our db and then render that HTML directly in the frontend. It’s not pretty but it’s extremely hard to change right now. Because of this I’m afraid it would be difficult to use the highlight tags feature you have with typesense to show the user where the query matches if we were to return non HTML hits. Basically it would be hard to render from the hits if the HTML didn’t exist
Maybe if there’s a way to store HTML in a non searchable field and then store the text in a searchable field that would solve this ? But it would double each document size
so my question is is there a way to customize the indexing to strip the HTML tags before the doc is indexed
r
"Maybe if there’s a way to store HTML in a non searchable field and then store the text in a searchable field" https://typesense.org/docs/0.20.0/api/collections.html#with-pre-defined-schema "Your documents can contain other fields not mentioned in the collection's schema - they will be stored on disk but not indexed in memory." That said your
query_by
will define what gets searched on.
k
@Shouvik D'Costa Would it help if we added a flag to skip html tags at indexing, but still return the tags in the response?
s
@Kishore Nallan yes that would be the perfect solution for us! I believe if this is the case then the highlight tags would also be in the correct offset positions of the original HTML. Is this difficult to implement this flag ?
@Ricardo thanks yes that’s exactly what I meant to refer to
k
Cool, that should be possible to do it once we figure out what kind of library we need for the HTML parsing. Can you please create an issue on Github so that we can track and so that others can also discover / chime-in?
👍 1
s
Will do and I’ll post it here when created