Rubai Mandal
12/24/2024, 10:04 AMINFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'INFO',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Custom Bot'}
2024-12-24 14:42:11 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'INFO',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Custom Bot'}
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2024-12-24 14:42:11 [scrapy.core.engine] INFO: Spider opened
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:33301/session/d7ac1c7caa02db9188604a2681570d27/window> {}
DEBUG:urllib3.connectionpool:<http://localhost:33301> "DELETE /session/d7ac1c7caa02db9188604a2681570d27/window HTTP/1.1" 200 12
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:33301/session/d7ac1c7caa02db9188604a2681570d27> {}
DEBUG:urllib3.connectionpool:<http://localhost:33301> "DELETE /session/d7ac1c7caa02db9188604a2681570d27 HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
Getting sitemap links
Crawling config
Nb hits: 0
attaching my code
settings = {
'LOG_ENABLED': '1',
'LOG_LEVEL': 'INFO',
'USER_AGENT': config.user_agent,
'DOWNLOADER_MIDDLEWARES': {DOWNLOADER_MIDDLEWARES_PATH: 900},
# Need to be > 600 to be after the redirectMiddleware
'DUPEFILTER_USE_ANCHORS': config.use_anchors,
# Use our custom dupefilter in order to be scheme agnostic regarding link provided
'DUPEFILTER_CLASS': DUPEFILTER_CLASS_PATH,
'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS,
'TELNETCONSOLE_ENABLED': False
}
print("Crawling config")
crawler = CrawlerRunner(settings)
crawler.crawl(DocumentationSpider,
config=config,
typesense_helper=typesense_helper,
strategy=strategy)
await crawler.join().asFuture(loop=asyncio.get_event_loop())
# Kill browser if needed
BrowserHandler.destroy(config.driver)
if len(config.extra_records) > 0:
typesense_helper.add_records(
config.extra_records, "Extra records", False)
print("")
if DocumentationSpider.NB_INDEXED > 0:
typesense_helper.commit_tmp_collection()
print('Nb hits: {}'.format(DocumentationSpider.NB_INDEXED))
config.update_nb_hits_value(DocumentationSpider.NB_INDEXED)
else:
print('Crawling issue: nbHits 0 for ' + config.index_name)
exit(EXIT_CODE_NO_RECORD)
print("")
Jason Bosco
12/24/2024, 5:51 PMJason Bosco
12/24/2024, 5:52 PMJason Bosco
12/24/2024, 5:52 PMRubai Mandal
12/24/2024, 6:09 PMstart_urls
of config .
can it be the crawler issue ?
crawler = CrawlerRunner(settings)
crawler.crawl(DocumentationSpider,
config=config,
typesense_helper=typesense_helper,
strategy=strategy)
await crawler.join().asFuture(loop=asyncio.get_event_loop())
Rubai Mandal
12/24/2024, 7:07 PMModified config {"index_name": "Developer_Docs_blog", "js_render": true, "use_anchors": false, "user_agent": "Custom Bot", "start_urls": ["<https://juspay.io/in/docs/blog/magento/overview/pre-requisites>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/installation>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/generating-rsa-key-pair>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/magento-backend-setup>", "<https://juspay.io/in/docs/blog/magento/resources/useful-links>", "<https://juspay.io/in/docs/blog/magento/resources/transaction-status>", "<https://juspay.io/in/docs/blog/magento/resources/refunds>", "<https://juspay.io/in/docs/blog/woocommerce/overview/pre-requisites>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/installation>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/generating-rsa-key-pair>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/woocommerce-dashboard-setup>", "<https://juspay.io/in/docs/blog/woocommerce/resources/useful-links>", "<https://juspay.io/in/docs/blog/woocommerce/resources/transaction-status>", "<https://juspay.io/in/docs/blog/woocommerce/resources/refunds>"], "selectors": {"lvl0": "[data-search-class='lvl0']", "lvl1": "[data-search-class='lvl1']", "lvl2": "[data-search-class='lvl2'],[data-search-class='lvl3']", "text": "[data-search-class='text']"}, "strip_chars": " .,;:#", "scrap_start_urls": true, "custom_settings": {}}
INFO:scrapy.addons:Enabled addons:
[]
2024-12-25 00:36:35 [scrapy.addons] INFO: Enabled addons:
[]
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'INFO',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Custom Bot'}
2024-12-25 00:36:35 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'INFO',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Custom Bot'}
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2024-12-25 00:36:35 [scrapy.core.engine] INFO: Spider opened
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:44157/session/804708fe4264d7931c62dddf894bbe46/window> {}
DEBUG:urllib3.connectionpool:<http://localhost:44157> "DELETE /session/804708fe4264d7931c62dddf894bbe46/window HTTP/1.1" 200 12
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:44157/session/804708fe4264d7931c62dddf894bbe46> {}
DEBUG:urllib3.connectionpool:<http://localhost:44157> "DELETE /session/804708fe4264d7931c62dddf894bbe46 HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
Crawling config
Crawling issue: nbHits 0 for Developer_Docs_blog
Rubai Mandal
12/24/2024, 7:11 PMrun_update_config
Rubai Mandal
12/25/2024, 7:29 PMRubai Mandal
12/25/2024, 8:05 PMDEBUG:typesense.api_call:Making delete /collections/Developer_Docs_blog_1735156203
DEBUG:typesense.api_call:Try 1 to node typesense.jp-internal.svc.cluster.local:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): typesense.jp-internal.svc.cluster.local:8108
DEBUG:urllib3.connectionpool:<http://typesense.jp-internal.svc.cluster.local:8108> "DELETE /collections/Developer_Docs_blog_1735156203 HTTP/1.1" 404 None
DEBUG:typesense.api_call:typesense.jp-internal.svc.cluster.local:8108 is healthy. Status code: 404
DEBUG:typesense.api_call:Making post /collections
DEBUG:typesense.api_call:Try 1 to node typesense.jp-internal.svc.cluster.local:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): typesense.jp-internal.svc.cluster.local:8108
DEBUG:urllib3.connectionpool:<http://typesense.jp-internal.svc.cluster.local:8108> "POST /collections HTTP/1.1" 201 None
DEBUG:typesense.api_call:typesense.jp-internal.svc.cluster.local:8108 is healthy. Status code: 201
INFO:scrapy.addons:Enabled addons:
Rubai Mandal
12/26/2024, 2:48 PM>>>>Crawler started. 7306.301826105
>>>>Crawler completed. 7306.303974783
>>>>Execution time: 0.00 seconds
@Jason Bosco @Kishore Nallan can you guys please help me out what can be the issue . same code working perfectly before but after change the driver code this issue is happing .
driver = webdriver.Chrome(service=Service(), options=chrome_options)
same driver we are able to fetch sitemap urls using driver but can't able to scrape the data . also didn't getting any Error log also .
kindly please have a lookJason Bosco
12/26/2024, 4:31 PMRubai Mandal
12/26/2024, 4:54 PMRubai Mandal
12/27/2024, 1:17 PMJason Bosco
12/27/2024, 9:45 PMRubai Mandal
12/28/2024, 5:24 AM