#community-help

Resolving HTML Content Search Issues

TLDR Ramy encountered issues with HTML content search within tags. Jason initially suggested adding special characters to the token_separators config but later recommended storing plain text of the HTML content. Ramy appreciated the advice. Ed also weighed in.

Powered by Struct AI

2

1

1

12
2mo
Solved
Join the chat
Sep 19, 2023 (2 months ago)
Ramy
Photo of md5-0b9bded4074dd1a8a3946bdba95aaebc
Ramy
12:57 AM
Hello 👋
I am seeing a weird behavior (I am sure it can be fixed via some config)
We have some HTML content saved and indexed, but if we do a search by a word within tags with no space between, it will not be matched (although it can be matched if we include the > or the full tag`)
Image 1 for Hello :wave:
I am seeing a weird behavior (I am sure it can be fixed via some config)
We have some HTML content saved and indexed, but if we do a search by a word within tags with no space between, it will not be matched (although it can be matched if we include the `>` or the full tag`)Image 2 for Hello :wave:
I am seeing a weird behavior (I am sure it can be fixed via some config)
We have some HTML content saved and indexed, but if we do a search by a word within tags with no space between, it will not be matched (although it can be matched if we include the `>` or the full tag`)Image 3 for Hello :wave:
I am seeing a weird behavior (I am sure it can be fixed via some config)
We have some HTML content saved and indexed, but if we do a search by a word within tags with no space between, it will not be matched (although it can be matched if we include the `>` or the full tag`)Image 4 for Hello :wave:
I am seeing a weird behavior (I am sure it can be fixed via some config)
We have some HTML content saved and indexed, but if we do a search by a word within tags with no space between, it will not be matched (although it can be matched if we include the `>` or the full tag`)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:59 AM
You want to add < and > to the token_separators config when creating the collection

1

12:59
Jason
12:59 AM
Ramy
Photo of md5-0b9bded4074dd1a8a3946bdba95aaebc
Ramy
12:59 AM
I love how the founders are very responsive ❤️

1

01:00
Ramy
01:00 AM
should I include the / in the </p> ?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:14 AM
Oh yeah, / as well
Ramy
Photo of md5-0b9bded4074dd1a8a3946bdba95aaebc
Ramy
01:15 AM
Is separating by these tokens better or storing a plain text of the HTML content and searching in it instead?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:52 AM
I would recommend storing a plain text of the HTML content: https://typesense.org/docs/guide/tips-for-searching-common-types-of-data.html#html-content

1

01:54
Jason
01:54 AM
I originally (mis)interpreted your question as wanting to search the html tags specifically... But re-reading it, I now see what you meant. So following the approach in the link above would be better
Ramy
Photo of md5-0b9bded4074dd1a8a3946bdba95aaebc
Ramy
01:54 AM
great, thank you very much

1

Ed
Photo of md5-120c789e9edae8b90bf59cf0e2612b66
Ed
09:04 AM
but I think this is working fine even in html tags
Image 1 for but I think this is working fine even in html tags
09:11
Ed
09:11 AM
ah it appears only if you’re searching for network engineer and the html tag is in between two words”network <b> engineer”

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Docsearch Scrapper Metadata Configuration and Filter Problem

Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.

2

82
8mo
Solved

Phrase Search Relevancy and Weights Fix

Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.

6

111
8mo
Solved

Ignoring HTML Tags in Typesense Document Search

Shouvik inquired about avoiding HTML tags in Typesense searches. Kishore Nallan and Ricardo suggested storing HTML in non-searchable fields. Kishore Nallan proposed adding an HTML-skip flag at indexing, to which Shouvik agreed, and initiated an issue tracking on Github.

1

12
33mo
Solved

Troubleshooting Issues with DocSearch Hits and Scraper Configuration

Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.

7

131
8mo
Solved

Issues with Repeated Words and Hyphen Queries in Typesense API

JinW discusses issues with repeated word queries and hyphen-containing queries in Typesense. Kishore Nallan offers possible solutions. During the discussion, Mr seeks advice on `token_separators` and how to send custom headers. Issues remain with repeated word queries.

8

43
25mo