#community-help

Docsearch Scrapper Metadata Configuration and Filter Problem

TLDR Marcos faced issues with Docsearch scrapper not adding metadata attributes and filtering out documents without content. Jason helped fix the issue by updating the scraper and providing filtering instructions.

Powered by Struct AI

2

82
8mo
Solved
Join the chat
Mar 28, 2023 (8 months ago)
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:15 PM
Hi guys! I'm using the Docsearch scrapper but it doesn't include the atributes from the metadata docsearch in the schema. How should I configure the docsearch.config.js to make it work?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:23 PM
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:28 PM
I've just tested it and it didn't work
11:29
Marcos
11:29 PM
nothing added to the schema
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:30 PM
Is your docs site public that I can take a look at and are you running the scraper against this public site?
11:31
Marcos
11:31 PM
Image 1 for
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:31 PM
Could you share the scraper config you’re using?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:32 PM
{
  "index_name": "croct-docs",
  "start_urls": [
    ""
  ],
  "sitemap_urls": [
    ""
  ],
  "sitemap_alternate_links": true,
  "stop_urls": [],
  "selectors": {
    "lvl0": {
      "selector": "//aside/nav//*[contains(@class, \"selected\")]/preceding::li[contains(@class, \"title\")][1]",
      "type": "xpath",
      "global": true,
      "default_value": "Documentation"
    },
    "lvl1": "h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5",
    "lvl6": "article h6",
    "text": "article p, article li"
  },
  "strip_chars": " .,;:#",
  "js_render": false
}

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:36 PM
Hmm, I just ran the scraper with that config, and I see those attributes being added to the documents in Typesense
Image 1 for Hmm, I just ran the scraper with that config, and I see those attributes being added to the documents in Typesense
11:36
Jason
11:36 PM
Are you on Typesense Cloud?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:36 PM
Yes
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:36 PM
Could you DM me your cluster ID?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:36 PM
it's added indeed to the document, but not to the schema, so when I run a query I get:

{
  "message": "Could not find a field named `title_tag` in the schema."
}
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:37 PM
Hmm, I see it in the schema as well… when I tried it
11:37
Jason
11:37 PM
Image 1 for
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:38 PM
I've just tried to delete the collection to recreate it from scratch and it worked
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:39 PM
Ah ok, my guess is that the scraper hadn’t completed re-indexing fully once the meta tags were added, so you might have seen the error if you tried to search just before that
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
11:40 PM
pretty useful this scrapper and Github action

1

Mar 29, 2023 (8 months ago)
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:11 AM
Jason, the scrapper produces documents without contentin some cases. How can I filter them out from my query results?
12:17
Marcos
12:17 AM
Aparently there is a field called type that is content for paragraphs. However, I can't filter by it. The result is always empty. Any ideais why?
12:17
Marcos
12:17 AM
I'm just doing filter_by=type:content
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:19 AM
This happens when sections on the page don’t have any content below a heading (eg: just an image)
12:19
Jason
12:19 AM
Or may be this selector is not picking up the content: "text": "article p, article li"
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:20 AM
This is not the case. The scrapper is creating an entry that has no content, just the title
12:20
Marcos
12:20 AM
In this case, type=lvl1
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:21 AM
Hmm I haven’t dug into the exact structure of those docs generated by the scraper since the docsearch ui takes care of using it appropriately. Let me check how the typesense docs look
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:21 AM
there is two entries: one just with the title and no content (type=lvl1) and another with the paragraph below it and with content (type=content)
12:21
Marcos
12:21 AM
I want to filter these lvl* entries out since they don't help
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:24 AM
I see the same in Typesense docs as well. But the docsearch-ui handles it
Image 1 for I see the same in Typesense docs as well. But the docsearch-ui handles itImage 2 for I see the same in Typesense docs as well. But the docsearch-ui handles it
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:25 AM
the problem is that it will affect the limit I'm passing
12:25
Marcos
12:25 AM
I want at most 15 records. Supposing 5 are empty, the result will be 10, even though I've 15+ matches
12:26
Marcos
12:26 AM
is there any way to use the filter_by to remove the documents without content or to filter by the type?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:36 AM
There is no way to remove documents with null content. This would be the way to filter by type https://typesense-community.slack.com/archives/C01P749MET0/p1680049072171189?thread_ts=1680045323.244179&cid=C01P749MET0
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:36 AM
it doesn't work =/
12:36
Marcos
12:36 AM
could you try to confirm if it's a bug?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:37 AM
What error do you see when you try that?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:38 AM
no erro, just no result at all
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:38 AM
Could you look for a network request to Typesense in the browser’s dev console, right-click, copy-as-curl and share that here?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:39 AM
documents/search?q=to&query_by=title_tag,content&group_by=route_tag&include_fields=type,content,title_tag,anchor,hierarchy&highlight_fields=content&filter_by=language:en-us,type:content
12:39
Marcos
12:39 AM
with type:content the result is empty
12:40
Marcos
12:40 AM
without it, I get a few results
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:40 AM
Could you copy-as-curl that request? So i can run that curl command locally?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:42 AM
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://.../collections/.../documents/search?q=to&query_by=title_tag,content&group_by=route_tag&include_fields=type,content,title_tag,anchor,hierarchy&highlight_fields=content&filter_by=language:en-us,type:content"
12:44
Marcos
12:44 AM
it's probably not working because type is not registered to the schema
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:46 AM
Right, that’s what I thought too initially. But it should have thrown an error if you try to filter on a field that doesn’t exist in the schema
12:47
Jason
12:47 AM
Looks like we’re missing a validation there.
12:47
Jason
12:47 AM
In any case, let me add type as a field to the schema in the scraper and publish a new version shortly
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
12:47 AM
So, apparently there is a bug in the scrapper (that doesn't include this filed in the schema) and in the validation
12:48
Marcos
12:48 AM
> In any case, let me add type as a field to the schema in the scraper and publish a new version shortly
It'll help a lot!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:57 AM
Could you update to typesense/docsearch-scraper:0.4.1 docker image and re-run the scraper?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:02 AM
sure
01:02
Marcos
01:02 AM
give a minute
01:02
Marcos
01:02 AM
*give me
01:07
Marcos
01:07 AM
it's now there!
Image 1 for it's now there!
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:08 AM
Great, thank you for confirming.
01:08
Jason
01:08 AM
The filter_by syntax should be filter_by=language:en-us && type:content
01:08
Jason
01:08 AM
Use && instead of , between multiple conditions
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:12 AM
it works perfectly, tks!

1

01:12
Marcos
01:12 AM
is there any way to limit the number of grouped items?
01:12
Marcos
01:12 AM
I mean, not the number of groups, but the max number of items per group
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:12 AM
group_limit is the search param you can use to control that
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:13 AM
got it
01:13
Marcos
01:13 AM
per_page = max number of groups
group_limit = max number of entries per group
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:13 AM
Yup
01:14
Jason
01:14 AM
Btw, are you building your own search UI?
01:14
Jason
01:14 AM
If so, out of curiosity, any reason you’re not using the docsearch UI library?
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:21 AM
Yep
01:21
Marcos
01:21 AM
we've a design system
01:21
Marcos
01:21 AM
so we need to stick to it
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:22 AM
I see, good to know!
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:23 AM
Image 1 for
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:23 AM
Nice!
01:24
Jason
01:24 AM
Btw, you want to add sort_by=_text_match:desc,item_priority:desc as an additional query parameter
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
01:27 AM
thanks for the tip!
01:28
Marcos
01:28 AM
I assumed that it would sort by default by the relevance
01:29
Marcos
01:29 AM
Btw, in this case, shouldn't the second result come before the first since the match at the beginning?
Image 1 for Btw, in this case, shouldn't the second result come before the first since the match at the beginning?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:42 AM
https://typesense-community.slack.com/archives/C01P749MET0/p1680053290080709?thread_ts=1680045323.244179&cid=C01P749MET0

You’re right. I was thinking of something else. Scratch what I said earlier. You shouldn’t have to specify it explicitly, since that’s the default behavior.
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
02:13 AM
but my question remains
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:45 AM
We don’t use word position for ranking by default. You want to set prioritize_token_position=true for that

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Troubleshooting Issues with DocSearch Hits and Scraper Configuration

Rubai encountered issues with search result priorities and ellipsis. Jason helped debug the issue and suggested using different versions of typesense-docsearch.js, updating initialization parameters, and running the scraper on a Linux-based environment. The issues related to hits structure and scraper configuration were resolved.

7

131
8mo
Solved

Trouble with DocSearch Scraper and Pipenv Across Multiple OSs

James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.

7

161
10mo

Solving Typesense Docsearch Scraper Issues

Sandeep was having issues with Typesense's docsearch scraper and getting fewer results than with Algolia's scraper. Jason helped by sharing the query they use and advised checking the running version of the scraper. The issue was resolved when Sandeep ran the non-base regular docker image.

28
24mo
Solved

Phrase Search Relevancy and Weights Fix

Jan reported an issue with phrase search relevancy using Typesense Instantsearch Adapter. The problem occurred when searching phrases with double quotes. The team identified the issue to be related to weights and implemented a fix, improving the search results.

6

111
8mo
Solved

Troubleshooting Local Scraper & Sitemap Issues

Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.

3

36
9mo
Solved