#community-help

Ignoring HTML Tags in Typesense Document Search

TLDR Shouvik inquired about avoiding HTML tags in Typesense searches. Kishore Nallan and Ricardo suggested storing HTML in non-searchable fields. Kishore Nallan proposed adding an HTML-skip flag at indexing, to which Shouvik agreed, and initiated an issue tracking on Github.

Powered by Struct AI

1

12
33mo
Solved
Join the chat
May 01, 2021 (33 months ago)
Shouvik
Photo of md5-67fe11c640c26eb6c6aa947473942d60
Shouvik
02:05 PM
Hello :) I was wondering does typesense have a way to ignore strings in a document ? We store HTML data and would like to ignore those HTML tags as well as any HTML attributes when performing search. Eg. If I search for ‘mark’ then I shouldn’t get hits for ‘<mark>...’
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:06 PM
Is there a reason you can't remove those HTML tags before indexing the test.
Shouvik
Photo of md5-67fe11c640c26eb6c6aa947473942d60
Shouvik
02:12 PM
So we store HTML in our db and then render that HTML directly in the frontend. It’s not pretty but it’s extremely hard to change right now. Because of this I’m afraid it would be difficult to use the highlight tags feature you have with typesense to show the user where the query matches if we were to return non HTML hits. Basically it would be hard to render from the hits if the HTML didn’t exist
02:13
Shouvik
02:13 PM
Maybe if there’s a way to store HTML in a non searchable field and then store the text in a searchable field that would solve this ? But it would double each document size
05:45
Shouvik
05:45 PM
so my question is is there a way to customize the indexing to strip the HTML tags before the doc is indexed
May 02, 2021 (33 months ago)
Ricardo
Photo of md5-914a8b39b82fd99b8ecd985427660deb
Ricardo
06:16 AM
"Maybe if there’s a way to store HTML in a non searchable field and then store the text in a searchable field"

https://typesense.org/docs/0.20.0/api/collections.html#with-pre-defined-schema
"Your documents can contain other fields not mentioned in the collection's schema - they will be stored on disk but not indexed in memory."

That said your query_by will define what gets searched on.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:53 AM
Shouvik Would it help if we added a flag to skip html tags at indexing, but still return the tags in the response?
Shouvik
Photo of md5-67fe11c640c26eb6c6aa947473942d60
Shouvik
01:29 PM
Kishore Nallan yes that would be the perfect solution for us! I believe if this is the case then the highlight tags would also be in the correct offset positions of the original HTML. Is this difficult to implement this flag ?
01:30
Shouvik
01:30 PM
Ricardo thanks yes that’s exactly what I meant to refer to
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:47 PM
Cool, that should be possible to do it once we figure out what kind of library we need for the HTML parsing. Can you please create an issue on Github so that we can track and so that others can also discover / chime-in?

1

Shouvik
Photo of md5-67fe11c640c26eb6c6aa947473942d60
Shouvik
01:49 PM
Will do and I’ll post it here when created

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community