#community-help

Troubleshooting Local Scraper & Sitemap Issues

TLDR Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.

Powered by Struct AI

2

1

36
7mo
Solved
Join the chat
Mar 13, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
07:04 AM
hi , I am trying to run the scraper in local . I tried with ngrock &amp; 80 port but not scrapping my docs . but when I put <https://example.com> instead of <http://host.docker.internal/>.. , it’s worked perfectly .
can anyone please help me out .
here is config file
    {
    "index_name": "payment-page",
    "js_render": true,
    "js_wait": 10,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      ""
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }
  

output : &gt; DocSearch: http://host.docker.internal/payment-page/android/base-sdk-integration/session 0 records)
10:14
Rubai
10:14 AM
Kishore Nallan Jason please help me out
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:20 AM
Jason will be able to help when he is online later.

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:23 PM
What link do you use to access your documentation site from your browser?
Mar 14, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:05 AM
this issue is resolved , is typesense able to scrape all the links if the config have multiple index ?

exaple config
[
 {
    "index_name": "payment-page",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      "",
      ""
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  },
  {
    "index_name": "payment-page2",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "",
      "",
      ""
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }
]

  
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:03 PM
Yup shouldn’t be a problem
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
02:26 PM
Can you please guide me ,how to do it in a single config file ? Previously I tried but only last index url's are scraped .
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:33 PM
Oh I see what you’re trying to do…

It can’t be an array of configs like that
03:34
Jason
03:34 PM
You’d have to include all start_urls in the same (single) config object
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:31 PM
then how to restrict the url hits , which we don't want to show in search result.
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:49 PM
but stop url's stop the scraping for that url .
actually in my docs we have more than one product . and we want that when we enter a product in search result it shows result for that product only .
06:50
Rubai
06:50 PM
like in my config payment-page &amp; payment-page2.
are two different product
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:39 PM
I’m not sure I understand your use-case… Could you rephrase?
Mar 15, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
05:57 AM
https://docs.juspay.in/ you can see in the given link there are two product payment page and in-app-upi
so , we want to filter on search result such that one product's search hits not comes in others product search result .

so how would we able to do that in a single config file .
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:35 PM
Ah I see… You want to add a meta HTML tag to each of your docs pages like this: https://typesense.org/docs/guide/docsearch.html#add-docsearch-meta-tags-optional

So in your payment page docs pages, you’d add a meta tag to all pages called say:

<meta name="docsearch:product_tag" content="payment_page" />

In your in-app-upi docs pages, you’d add a meta tag to all pages called say:

<meta name="docsearch:product_tag" content="in_app_upi" />

Then on the front-end, depending on the product docs the user is visiting right now, you would pass in a filter_by parameter in the docsearch.js config like this:

https://typesense.org/docs/guide/docsearch.html#option-c-custom-docs-framework-with-docsearch-js-v3-modal-layout

See typesenseSearchParameters.filter_by.

In your case you’d set that like this:

docsearch({
    container: '#searchbar',
    typesenseCollectionName: 'docs', 
    typesenseServerConfig: { ... },
    typesenseSearchParameters: { 
      filter_by: 'product:=in_app_upi' 
    },
  });
Mar 16, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:25 AM
so you trying to say that , we would have to put all the urls of our docs in a single config file (every product urls ) . and add different meta tag on every pages for different product .
10:39
Rubai
10:39 AM
.docsearch({
      container: '#searchbar',
      typesenseCollectionName: "Developer_Docs", 
      typesenseServerConfig: { 
        nodes: [{
          host: 'localhost', 
          port: '8108',      
          protocol: 'http'  
        }],
        apiKey: 'xyz', 
      },
      typesenseSearchParameters: { 
      filter_by: `product:=${documentationJSON.documentation.productId}` 
    },
    });

here is my docsearch.jsconfig . for me it's won't work ,it's fetching 0 result on search-bar .
can you please help me out what I am doing wrong here ?

here is my meta tag
<meta name="docsearch:product_tag" content="{$page.params.products}" />
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:22 PM
Could you open the network inspector in your browser’s dev console, then do a search in the search bar and then right-click on the network request to multi_search, copy-as-curl and paste that here?
04:22
Jason
04:22 PM
Separately, could you do a GET /collections against your typesense node and post the output of that?
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
08:22 PM
curl 'http://localhost:8108/multi_search?x-typesense-api-key=xyz' \
-H 'Accept: application/json, text/plain, /' \
-H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
-H 'Cache-Control: no-cache' \
-H 'Connection: keep-alive' \
-H 'Content-Type: text/plain' \
-H 'Origin: http://localhost:3000' \
-H 'Pragma: no-cache' \
-H 'Referer: http://localhost:3000/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-site' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \
-H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "macOS"' \
--data-raw '{"searches":[{"collection":"Developer_Docs","q":"session ","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"filter_by":"product:=payment-page_android"}]}' \
--compressed
08:31
Rubai
08:31 PM
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:35 PM
Could you change

filter_by: `product:=${documentationJSON.documentation.productId}` 

to

filter_by: `product_tag:=${documentationJSON.documentation.productId}` 

and try again?
08:36
Jason
08:36 PM
If that also doesn’t work, could you get one sample document from the Typesense Collection and post the JSON here? I want to see the actual value for the field in the records
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
08:41 PM
Thanks it's working now

1

1

Mar 17, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:49 AM
would we able to do partial scraping in our config ? and can you please help me out why sitemap.xml urls are not scraped .
config :
  {
    "index_name": "Developer_Docs",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "",
      "",
      ""
    ],
    "sitemap_urls": [
      ""
    ],
    "sitemap_alternate_links": true,
    "selectors": {
      "lvl0":"h1,h2,[data-search-class='lvl0']",
      "lvl1":"h3,[data-search-class='lvl1']",
      "lvl2":"[data-search-class='lvl2']",
      "text":"p:not(:empty),[data-search-class='text']"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }

  
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:36 PM
Could you elaborate on what you mean by “partial scraping”?
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
06:45 PM
sure , we have a call tonight from 1 PM if you are available . want to clear some doubt and some features or issue
06:46
Rubai
06:46 PM
can you please help why sitemap_urls not working for me
Mar 18, 2023 (7 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
08:52 PM
hey Jason I am trying with live sitemap.xml link but still it wont work . (we discussed about it on a call yesterday ) . can you please help me out now why this thing happing
config :
  {
    "index_name": "Developer_Docs",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      
      ""
    ],
    "sitemap_urls": [
      ""
    ],
    "sitemap_alternate_links": true,
    "selectors": {
      "lvl0":"h1,h2,[data-search-class='lvl0']",
      "lvl1":"h3,[data-search-class='lvl1']",
      "lvl2":"[data-search-class='lvl2']",
      "text":"p,[data-search-class='text']"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }

   
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:01 PM
The crawler will only crawl links that start with the urls mentioned in start_urls, regardless of it being mentioned in the sitemap or not.

So for eg, in the above config, it will only crawl pages like this: <https://docs.juspay.in/payment-page/android/overview/integration-architecture/*>
09:02
Jason
09:02 PM
So you want to specify <https://docs.juspay.in/payment-page> in the start_urls since that seems to be your base url?
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
09:07 PM
it's wont work for me . can you please check once is anything problem on my sitemap.xml
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:08 PM
Your sitemap looks fine. I wonder if it also expect the sitemap to be in the same location as the start_urls…
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
09:10 PM
may be that is the case
09:10
Rubai
09:10 PM
DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET > - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

I got this