hi , I am trying to run the scraper in local . I t...
# community-help
r
hi , I am trying to run the scraper in local . I tried with ngrock & 80 port but not scrapping my docs . but when I put
<https://example.com>
instead of
<http://host.docker.internal/>..
, it’s worked perfectly . can anyone please help me out . here is config file
Copy code
{
  "index_name": "payment-page",
  "js_render": true,
  "js_wait": 10,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<http://host.docker.internal/payment-page/android/base-sdk-integration/session>"
  ],
  "sitemap_alternate_links": false,
  "selectors": {
    "lvl0": "h1,  h2 , .heading-text",
    "lvl1": "h3, .label",
    "lvl2": ".key-header, .step-card-header-text, .th-row",
    "text": ".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}
output : > DocSearch: http://host.docker.internal/payment-page/android/base-sdk-integration/session 0 records)
@Kishore Nallan @Jason Bosco please help me out
k
Jason will be able to help when he is online later.
👍 1
j
What link do you use to access your documentation site from your browser?
r
this issue is resolved , is typesense able to scrape all the links if the config have multiple index ? exaple config
Copy code
[
 {
    "index_name": "payment-page",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/integration-architecture>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/pre-requisites>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/session>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/order-status-api>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/getting-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/initiating-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/processing-sdk>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/handle-payment-response>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/life-cycle-events>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/error-codes>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/transaction-status>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/sample-payloads>"
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  },
  {
    "index_name": "payment-page2",
    "js_render": true,
    "js_wait": 5,
    "use_anchors": false,
    "user_agent": "Custom Bot",
    "start_urls": [
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/overview/integration-architecture>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/transaction-status>",
      "<https://9491-103-159-11-202.in.ngrok.io/payment-page/android/resources/sample-payloads>"
    ],
    "sitemap_alternate_links": false,
    "selectors": {
      "lvl0":"h1,  h2 , .heading-text" ,
      "lvl1": "h3, .label" ,
      "lvl2": ".key-header, .step-card-header-text, .th-row",
      "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
    },
    "strip_chars": " .,;:#",
    "scrap_start_urls": true,
    "custom_settings": {
      "synonyms": {
        "relevancy": ["relevant", "relevance"],
        "relevant": ["relevancy", "relevance"],
        "relevance": ["relevancy", "relevant"]
      }
    }
  }
]
j
Yup shouldn’t be a problem
r
Can you please guide me ,how to do it in a single config file ? Previously I tried but only last index url's are scraped .
j
Oh I see what you’re trying to do… It can’t be an array of configs like that
You’d have to include all start_urls in the same (single) config object
r
then how to restrict the url hits , which we don't want to show in search result.
j
r
but stop url's stop the scraping for that url . actually in my docs we have more than one product . and we want that when we enter a product in search result it shows result for that product only .
like in my config
payment-page
&
payment-page2.
are two different product
j
I’m not sure I understand your use-case… Could you rephrase?
r
https://docs.juspay.in/ you can see in the given link there are two product
payment page
and
in-app-upi
so , we want to filter on search result such that one product's search hits not comes in others product search result . so how would we able to do that in a single config file .
j
Ah I see… You want to add a meta HTML tag to each of your docs pages like this: https://typesense.org/docs/guide/docsearch.html#add-docsearch-meta-tags-optional So in your
payment page
docs pages, you’d add a meta tag to all pages called say:
Copy code
<meta name="docsearch:product_tag" content="payment_page" />
In your
in-app-upi
docs pages, you’d add a meta tag to all pages called say:
Copy code
<meta name="docsearch:product_tag" content="in_app_upi" />
Then on the front-end, depending on the product docs the user is visiting right now, you would pass in a
filter_by
parameter in the docsearch.js config like this: https://typesense.org/docs/guide/docsearch.html#option-c-custom-docs-framework-with-docsearch-js-v3-modal-layout See
typesenseSearchParameters.filter_by
. In your case you’d set that like this:
Copy code
docsearch({
    container: '#searchbar',
    typesenseCollectionName: 'docs', 
    typesenseServerConfig: { ... },
    typesenseSearchParameters: { 
      filter_by: 'product:=in_app_upi' 
    },
  });
r
so you trying to say that , we would have to put all the urls of our docs in a single config file (every product urls ) . and add different meta tag on every pages for different product .
Copy code
.docsearch({
      container: '#searchbar',
      typesenseCollectionName: "Developer_Docs", 
      typesenseServerConfig: { 
        nodes: [{
          host: 'localhost', 
          port: '8108',      
          protocol: 'http'  
        }],
        apiKey: 'xyz', 
      },
      typesenseSearchParameters: { 
      filter_by: `product:=${documentationJSON.documentation.productId}` 
    },
    });
here is my docsearch.jsconfig . for me it's won't work ,it's fetching 0 result on search-bar . can you please help me out what I am doing wrong here ? here is my meta tag
Copy code
<meta name="docsearch:product_tag" content="{$page.params.products}" />
j
Could you open the network inspector in your browser’s dev console, then do a search in the search bar and then right-click on the network request to multi_search, copy-as-curl and paste that here?
Separately, could you do a
GET /collections
against your typesense node and post the output of that?
r
curl 'http://localhost:8108/multi_search?x-typesense-api-key=xyz' \ -H 'Accept: application/json, text/plain, */*' \ -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \ -H 'Cache-Control: no-cache' \ -H 'Connection: keep-alive' \ -H 'Content-Type: text/plain' \ -H 'Origin: http://localhost:3000' \ -H 'Pragma: no-cache' \ -H 'Referer: http://localhost:3000/' \ -H 'Sec-Fetch-Dest: empty' \ -H 'Sec-Fetch-Mode: cors' \ -H 'Sec-Fetch-Site: same-site' \ -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' \ -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ --data-raw '{"searches":[{"collection":"Developer_Docs","q":"session ","query_by":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","include_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,anchor,url,type,id","highlight_full_fields":"hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content","group_by":"url","group_limit":3,"filter_by":"product:=payment-page_android"}]}' \ --compressed
collections.txt
j
Could you change
Copy code
filter_by: `product:=${documentationJSON.documentation.productId}`
to
Copy code
filter_by: `product_tag:=${documentationJSON.documentation.productId}`
and try again?
If that also doesn’t work, could you get one sample document from the Typesense Collection and post the JSON here? I want to see the actual value for the field in the records
r
Thanks it's working now
🙌 1
👍 1
would we able to do partial scraping in our config ? and can you please help me out why sitemap.xml urls are not scraped . config :
Copy code
{
  "index_name": "Developer_Docs",
  "js_render": true,
  "js_wait": 5,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/overview/integration-architecture>",
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/overview/pre-requisites>",
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/ios/resources/error-codes>"
  ],
  "sitemap_urls": [
    "<https://62bc-119-82-122-182.in.ngrok.io/payment-page/android/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": "h1,h2,[data-search-class='lvl0']",
    "lvl1": "h3,[data-search-class='lvl1']",
    "lvl2": "[data-search-class='lvl2']",
    "text": "p:not(:empty),[data-search-class='text']"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}
j
Could you elaborate on what you mean by “partial scraping”?
r
sure , we have a call tonight from 1 PM if you are available . want to clear some doubt and some features or issue
can you please help why
sitemap_urls
not working for me
hey @Jason Bosco I am trying with live sitemap.xml link but still it wont work . (we discussed about it on a call yesterday ) . can you please help me out now why this thing happing config :
Copy code
{
  "index_name": "Developer_Docs",
  "js_render": true,
  "js_wait": 5,
  "use_anchors": false,
  "user_agent": "Custom Bot",
  "start_urls": [
    "<https://docs.juspay.in/payment-page/android/overview/integration-architecture>"
  ],
  "sitemap_urls": [
    "<https://testing-chi-eight.vercel.app/sitemap.xml>"
  ],
  "sitemap_alternate_links": true,
  "selectors": {
    "lvl0": "h1,h2,[data-search-class='lvl0']",
    "lvl1": "h3,[data-search-class='lvl1']",
    "lvl2": "[data-search-class='lvl2']",
    "text": "p,[data-search-class='text']"
  },
  "strip_chars": " .,;:#",
  "scrap_start_urls": true,
  "custom_settings": {
    "synonyms": {
      "relevancy": [
        "relevant",
        "relevance"
      ],
      "relevant": [
        "relevancy",
        "relevance"
      ],
      "relevance": [
        "relevancy",
        "relevant"
      ]
    }
  }
}
j
The crawler will only crawl links that start with the urls mentioned in
start_urls
, regardless of it being mentioned in the sitemap or not. So for eg, in the above config, it will only crawl pages like this:
<https://docs.juspay.in/payment-page/android/overview/integration-architecture/*>
So you want to specify
<https://docs.juspay.in/payment-page>
in the
start_urls
since that seems to be your base url?
r
it's wont work for me . can you please check once is anything problem on my sitemap.xml
j
Your sitemap looks fine. I wonder if it also expect the sitemap to be in the same location as the start_urls…
r
may be that is the case
Copy code
DEBUG:scrapy.dupefilters:Filtered duplicate request: <GET <https://docs.juspay.in/payment-page>> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
I got this