#community-help

Using Highlights in typesense-go

TLDR Oliver worried about using highlights involving HTML tags in typesense-go, as they mix trusted and untrusted content. Jason advises HTML sanitization before ingesting data and using arbitrary strings as highlighters.

Powered by Struct AI

2

Sep 28, 2023 (2 months ago)
Oliver
Photo of md5-6d1b9f4b5754e51ccd9481dbb5930983
Oliver
05:22 PM
Hello! I'm having a hard time understanding the "correct" way to use highlights. For full context we're using typesense-go but I am not sure how much that matters.

I note that in typesense-go, two search parameters HighlightStartTag and HighlightEndTag default to <mark> and </mark> respectively. This suggests to me that the expectation is that you should be able to use HTML tags as the highlight markers. However, I also note that the Typesense results are not HTML escaped. So if we have a document that says something like "did you know that <script>alert(1)</script> is an xss payload" then if I search for the word "know", I end up with did you <mark>know</mark> that <script>alert(1)</script> is an xss payload and I don't know what to do with this. If I render it, the script fires, which is obviously not what I want. If I escape it, the highlight doesn't work.

I could run each of the results through an HTML sanitizer, but before going down that route I just want to do a quick sanity check here that i'm not doing something silly, because I have a hard time believing that the expected usage case involves receiving a string that mixes both trusted and untrusted content.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:38 PM
Oliver
Photo of md5-6d1b9f4b5754e51ccd9481dbb5930983
Oliver
05:41 PM
thank you!
05:44
Oliver
05:44 PM
though i don't think that's quite our problem -- it's not that we're searching HTML content, it's that we're searching documents whose text may contain sequences of characters which are also valid html
05:44
Oliver
05:44 PM
that is, I would want the result to look like this:

did you know that <script>alert(1)</script> is an xss payload

not like this:

did you know that is an xss payload
05:47
Oliver
05:47 PM
We can engineer our way around it though, mainly by using some unlikely-to-appear string as HighlightStartTag and HighlightEndTag, HTML-escaping the result, then swapping in the hightlight HTML tags for the unlikely-to-appear strings that we used. I just assumed there would be a simpler way, since typesense-go defaults to using HTML tags directly as these values, and I was having a hard time reconciling that with needing to do this
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:48 PM
You would have to do HTML sanitization before ingesting the data into Typesense, if you don't expect HTML data to normally occur in the dataset
05:48
Jason
05:48 PM
> mainly by using some unlikely-to-appear string as HighlightStartTag and HighlightEndTag, HTML-escaping the result, then swapping in the hightlight HTML tags
Yeah, that's the other thing I was going to suggest
05:48
Jason
05:48 PM
Basically html content is no different to Typesense than any other text data
05:49
Jason
05:49 PM
Except for the <mark> tags used for highlighting by default, but that can be changed to any arbitrary string, not HTML tags necessarily
Oliver
Photo of md5-6d1b9f4b5754e51ccd9481dbb5930983
Oliver
05:51 PM
got it, thank you for your help 🙂

1

05:51
Oliver
05:51 PM
It seems to me to be a strange design choice to default to using an HTML tag as the highlight boundaries, when they cannot be used safely in this manner. But regardless, we can work around it with minimal difficulty
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:54 PM
In majority of use-cases for Typesense, documents are text-based without any HTML tags, displayed on a web-page. So that's the reason we've set html tags as the default markers, to make this common use-case work without any additional configuration
Oliver
Photo of md5-6d1b9f4b5754e51ccd9481dbb5930983
Oliver
06:05 PM
makes sense, thanks again

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads