Docsearch Scrapper Metadata Configuration and Filter Problem
TLDR Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.
2
Mar 28, 2023 (6 months ago)
Marcos
11:15 PMdocsearch
in the schema. How should I configure the docsearch.config.js
to make it work?Jason
11:23 PMMarcos
11:28 PMMarcos
11:29 PMJason
11:30 PMMarcos
11:30 PMMarcos
11:31 PMJason
11:31 PMMarcos
11:32 PM{
"index_name": "croct-docs",
"start_urls": [
" "
],
"sitemap_urls": [
""
],
"sitemap_alternate_links": true,
"stop_urls": [],
"selectors": {
"lvl0": {
"selector": "//aside/nav//*[contains(@class, \"selected\")]/preceding::li[contains(@class, \"title\")][1]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5",
"lvl6": "article h6",
"text": "article p, article li"
},
"strip_chars": " .,;:#",
"js_render": false
}
Jason
11:36 PMJason
11:36 PMMarcos
11:36 PMJason
11:36 PMMarcos
11:36 PM{
"message": "Could not find a field named `title_tag` in the schema."
}
Jason
11:37 PMJason
11:37 PMMarcos
11:38 PMJason
11:39 PMMarcos
11:40 PM1
Mar 29, 2023 (6 months ago)
Marcos
12:11 AMcontent
in some cases. How can I filter them out from my query results?Marcos
12:17 AMtype
that is content
for paragraphs. However, I can't filter by it. The result is always empty. Any ideais why?Marcos
12:17 AMfilter_by=type:content
Jason
12:19 AMJason
12:19 AM"text": "article p, article li"
Marcos
12:20 AMMarcos
12:20 AMtype=lvl1
Jason
12:21 AMMarcos
12:21 AMtype=lvl1
) and another with the paragraph below it and with content (type=content
)Marcos
12:21 AMJason
12:24 AMMarcos
12:25 AMMarcos
12:25 AMMarcos
12:26 AMfilter_by
to remove the documents without content or to filter by the type
?Jason
12:36 AMtype
https://typesense-community.slack.com/archives/C01P749MET0/p1680049072171189?thread_ts=1680045323.244179&cid=C01P749MET0Marcos
12:36 AMMarcos
12:36 AMJason
12:37 AMMarcos
12:38 AMJason
12:38 AMMarcos
12:39 AMdocuments/search?q=to&query_by=title_tag,content&group_by=route_tag&include_fields=type,content,title_tag,anchor,hierarchy&highlight_fields=content&filter_by=language:en-us,type:content
Marcos
12:39 AMtype:content
the result is emptyMarcos
12:40 AMJason
12:40 AMMarcos
12:42 AMcurl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://.../collections/.../documents/search?q=to&query_by=title_tag,content&group_by=route_tag&include_fields=type,content,title_tag,anchor,hierarchy&highlight_fields=content&filter_by=language:en-us,type:content"
Marcos
12:44 AMJason
12:46 AMJason
12:47 AMJason
12:47 AMtype
as a field to the schema in the scraper and publish a new version shortlyMarcos
12:47 AMMarcos
12:48 AMtype
as a field to the schema in the scraper and publish a new version shortlyIt'll help a lot!
Jason
12:57 AMtypesense/docsearch-scraper:0.4.1
docker image and re-run the scraper?Marcos
01:02 AMMarcos
01:02 AMMarcos
01:02 AMMarcos
01:07 AMJason
01:08 AMJason
01:08 AMfilter_by=language:en-us && type:content
Jason
01:08 AM&&
instead of ,
between multiple conditionsMarcos
01:12 AM1
Marcos
01:12 AMMarcos
01:12 AMJason
01:12 AMgroup_limit
is the search param you can use to control thatMarcos
01:13 AMMarcos
01:13 AMgroup_limit = max number of entries per group
Jason
01:13 AMJason
01:14 AMJason
01:14 AMMarcos
01:21 AMMarcos
01:21 AMMarcos
01:21 AMJason
01:22 AMMarcos
01:23 AMJason
01:23 AMJason
01:24 AMsort_by=_text_match:desc,item_priority:desc
as an additional query parameterMarcos
01:27 AMMarcos
01:28 AMMarcos
01:29 AMJason
01:42 AMYou’re right. I was thinking of something else. Scratch what I said earlier. You shouldn’t have to specify it explicitly, since that’s the default behavior.
Marcos
02:13 AMJason
02:45 AMprioritize_token_position=true
for thatTypesense
Indexed 2786 threads (79% resolved)
Similar Threads
Troubleshooting Issues with DocSearch Hits and Scraper Configuration
Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.
Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
Solving Typesense Docsearch Scraper Issues
Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.
Phrase Search Relevancy and Weights Fix
Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.
Troubleshooting Local Scraper & Sitemap Issues
Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.