hi I am trying to run the scraper in local I tried with ngro typesense #community-help

hi , I am trying to run the scraper in local . I t...

Rubai Mandal

03/13/2023, 7:04 AM

hi , I am trying to run the scraper in local . I tried with ngrock & 80 port but not scrapping my docs . but when I put

<https://example.com>

instead of

<http://host.docker.internal/>..

, it’s worked perfectly . can anyone please help me out . here is config file

Copy code

{
  "index_name": "payment-page",
  "js_render": true,
  "js_wait": 10,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<http://host.docker.internal/payment-page/android/base-sdk-integration/session>"
  ],
  "sitemap_alternate_links": false,
  "selectors": {
    "lvl0": "h1,  h2 , .heading-text",
    "lvl1": "h3, .label",
    "lvl2": ".key-header, .step-card-header-text, .th-row",
    "text": ".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}

output : > DocSearch: http://host.docker.internal/payment-page/android/base-sdk-integration/session 0 records)

Rubai Mandal

03/13/2023, 10:14 AM

@Kishore Nallan @Jason Bosco please help me out

Kishore Nallan

03/13/2023, 10:20 AM

Jason will be able to help when he is online later.

👍 1

Jason Bosco

03/13/2023, 6:23 PM

What link do you use to access your documentation site from your browser?

Rubai Mandal

03/14/2023, 6:05 AM

this issue is resolved , is typesense able to scrape all the links if the config have multiple index ? exaple config

Copy code

[
 {
    "index_name": "payment-page",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/integration-architecture>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/pre-requisites>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/session>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/order-status-api>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/getting-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/initiating-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/processing-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/handle-payment-response>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/life-cycle-events>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/error-codes>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/transaction-status>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/sample-payloads>"
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  },
  {
    "index_name": "payment-page2",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/integration-architecture>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/transaction-status>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/sample-payloads>"
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }
]

Jason Bosco

03/14/2023, 2:03 PM

Yup shouldn’t be a problem

Rubai Mandal

03/14/2023, 2:26 PM

Can you please guide me ,how to do it in a single config file ? Previously I tried but only last index url's are scraped .

Jason Bosco

03/14/2023, 3:33 PM

Oh I see what you’re trying to do… It can’t be an array of configs like that

Jason Bosco

03/14/2023, 3:34 PM

You’d have to include all start_urls in the same (single) config object

Rubai Mandal

03/14/2023, 6:31 PM

then how to restrict the url hits , which we don't want to show in search result.

Jason Bosco

03/14/2023, 6:33 PM

You can use stop_urls: https://docsearch.algolia.com/docs/legacy/config-file#stop_urls-optional

Rubai Mandal

03/14/2023, 6:49 PM

but stop url's stop the scraping for that url . actually in my docs we have more than one product . and we want that when we enter a product in search result it shows result for that product only .

Rubai Mandal

03/14/2023, 6:50 PM

like in my config

payment-page

payment-page2.

are two different product

Jason Bosco

03/14/2023, 11:39 PM

I’m not sure I understand your use-case… Could you rephrase?

Rubai Mandal

03/15/2023, 5:57 AM

https://docs.juspay.in/ you can see in the given link there are two product

payment page

and

in-app-upi

so , we want to filter on search result such that one product's search hits not comes in others product search result . so how would we able to do that in a single config file .

Jason Bosco

03/15/2023, 5:35 PM

Ah I see… You want to add a meta HTML tag to each of your docs pages like this: https://typesense.org/docs/guide/docsearch.html#add-docsearch-meta-tags-optional So in your

payment page

docs pages, you’d add a meta tag to all pages called say:

Copy code

<meta name="docsearch:product_tag" content="payment_page" />

In your

in-app-upi

docs pages, you’d add a meta tag to all pages called say:

Copy code

<meta name="docsearch:product_tag" content="in_app_upi" />

Then on the front-end, depending on the product docs the user is visiting right now, you would pass in a

filter_by

parameter in the docsearch.js config like this: https://typesense.org/docs/guide/docsearch.html#option-c-custom-docs-framework-with-docsearch-js-v3-modal-layout See

typesenseSearchParameters.filter_by

. In your case you’d set that like this:

Copy code

docsearch({
    container: '#searchbar',
    typesenseCollectionName: 'docs', 
    typesenseServerConfig: { ... },
    typesenseSearchParameters: { 
      filter_by: 'product:=in_app_upi' 
    },
  });

Rubai Mandal

03/16/2023, 6:25 AM

so you trying to say that , we would have to put all the urls of our docs in a single config file (every product urls ) . and add different meta tag on every pages for different product .

Rubai Mandal

03/16/2023, 10:39 AM

Copy code

.docsearch({
      container: '#searchbar',
      typesenseCollectionName: "Developer_Docs", 
      typesenseServerConfig: { 
        nodes: [{
          host: 'localhost', 
          port: '8108',      
          protocol: 'http'  
        }],
        apiKey: 'xyz', 
      },
      typesenseSearchParameters: { 
      filter_by: `product:=${documentationJSON.documentation.productId}` 
    },
    });

here is my docsearch.jsconfig . for me it's won't work ,it's fetching 0 result on search-bar . can you please help me out what I am doing wrong here ? here is my meta tag

Copy code

<meta name="docsearch:product_tag" content="{$page.params.products}" />

Jason Bosco

03/16/2023, 4:22 PM

Could you open the network inspector in your browser’s dev console, then do a search in the search bar and then right-click on the network request to multi_search, copy-as-curl and paste that here?

Jason Bosco

03/16/2023, 4:22 PM

Separately, could you do a

GET /collections

against your typesense node and post the output of that?

Rubai Mandal

03/16/2023, 8:22 PM

curl 'http://localhost:8108/multi_search?x-typesense-api-key=xyz' \ -H 'Accept: application/json, text/plain, */*' \ -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \ -H 'Cache-Control: no-cache' \ -H 'Connection: keep-alive' \ -H 'Content-Type: text/plain' \ -H 'Origin: http://localhost:3000' \ -H 'Pragma: no-cache' \ -H 'Referer: http://localhost:3000/' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: same-site' \ -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \ -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ --data-raw '{"searches":[{"collection":"Developer_Docs","q":"session ","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"filter_by":"product:=payment-page_android"}]}' \ --compressed

Rubai Mandal

03/16/2023, 8:31 PM

collections.txt

Jason Bosco

03/16/2023, 8:35 PM

Could you change

Copy code

filter_by: `product:=${documentationJSON.documentation.productId}`

Copy code

filter_by: `product_tag:=${documentationJSON.documentation.productId}`

and try again?

Jason Bosco

03/16/2023, 8:36 PM

If that also doesn’t work, could you get one sample document from the Typesense Collection and post the JSON here? I want to see the actual value for the field in the records

Rubai Mandal

03/16/2023, 8:41 PM

Thanks it's working now

🙌 1

👍 1

Rubai Mandal

03/17/2023, 6:49 AM

would we able to do partial scraping in our config ? and can you please help me out why sitemap.xml urls are not scraped . config :

Copy code

{
  "index_name": "Developer_Docs",
  "js_render": true,
  "js_wait": 5,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/overview/integration-architecture>",
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/overview/pre-requisites>",
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/resources/error-codes>"
  ],
  "sitemap_urls": [
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/android/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": "h1,h2,[data-search-class='lvl0']",
    "lvl1": "h3,[data-search-class='lvl1']",
    "lvl2": "[data-search-class='lvl2']",
    "text": "p:not(:empty),[data-search-class='text']"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}

Jason Bosco

03/17/2023, 4:36 PM

Could you elaborate on what you mean by “partial scraping”?

Rubai Mandal

03/17/2023, 6:45 PM

sure , we have a call tonight from 1 PM if you are available . want to clear some doubt and some features or issue

Rubai Mandal

03/17/2023, 6:46 PM

can you please help why

sitemap_urls

not working for me

Rubai Mandal

03/18/2023, 8:52 PM

hey @Jason Bosco I am trying with live sitemap.xml link but still it wont work . (we discussed about it on a call yesterday ) . can you please help me out now why this thing happing config :

Copy code

{
  "index_name": "Developer_Docs",
  "js_render": true,
  "js_wait": 5,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<https://docs.juspay.in/payment-page/android/overview/integration-architecture>"
  ],
  "sitemap_urls": [
    "<https://testing-chi-eight.vercel.app/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": "h1,h2,[data-search-class='lvl0']",
    "lvl1": "h3,[data-search-class='lvl1']",
    "lvl2": "[data-search-class='lvl2']",
    "text": "p,[data-search-class='text']"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}

Jason Bosco

03/18/2023, 9:01 PM

The crawler will only crawl links that start with the urls mentioned in

start_urls

, regardless of it being mentioned in the sitemap or not. So for eg, in the above config, it will only crawl pages like this:

<https://docs.juspay.in/payment-page/android/overview/integration-architecture/*>

Jason Bosco

03/18/2023, 9:02 PM

So you want to specify

<https://docs.juspay.in/payment-page>

in the

start_urls

since that seems to be your base url?

Rubai Mandal

03/18/2023, 9:07 PM

it's wont work for me . can you please check once is anything problem on my sitemap.xml

Jason Bosco

03/18/2023, 9:08 PM

Your sitemap looks fine. I wonder if it also expect the sitemap to be in the same location as the start_urls…

Rubai Mandal

03/18/2023, 9:10 PM

may be that is the case

Rubai Mandal

03/18/2023, 9:10 PM

Copy code

DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET <https://docs.juspay.in/payment-page>> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

I got this

Open in Slack

Previous Next