Hi , I am using a documentation search and I am un...
# community-help
r
Hi , I am using a documentation search and I am unable to scrape the data . getting Nb hits : 0 . can someone please help me out this is my log
Copy code
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'INFO',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Custom Bot'}
2024-12-24 14:42:11 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'INFO',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Custom Bot'}
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2024-12-24 14:42:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2024-12-24 14:42:11 [scrapy.core.engine] INFO: Spider opened
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:33301/session/d7ac1c7caa02db9188604a2681570d27/window> {}
DEBUG:urllib3.connectionpool:<http://localhost:33301> "DELETE /session/d7ac1c7caa02db9188604a2681570d27/window HTTP/1.1" 200 12
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:33301/session/d7ac1c7caa02db9188604a2681570d27> {}
DEBUG:urllib3.connectionpool:<http://localhost:33301> "DELETE /session/d7ac1c7caa02db9188604a2681570d27 HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
Getting sitemap links
Crawling config
Nb hits: 0
attaching my code
Copy code
settings = {
        'LOG_ENABLED': '1',
        'LOG_LEVEL': 'INFO',
        'USER_AGENT': config.user_agent,
        'DOWNLOADER_MIDDLEWARES': {DOWNLOADER_MIDDLEWARES_PATH: 900},
        # Need to be > 600 to be after the redirectMiddleware
        'DUPEFILTER_USE_ANCHORS': config.use_anchors,
        # Use our custom dupefilter in order to be scheme agnostic regarding link provided
        'DUPEFILTER_CLASS': DUPEFILTER_CLASS_PATH,
        'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS,
        'TELNETCONSOLE_ENABLED': False
    }

    print("Crawling config")

    crawler = CrawlerRunner(settings)
    crawler.crawl(DocumentationSpider,
                  config=config,
                  typesense_helper=typesense_helper,
                  strategy=strategy)

    await crawler.join().asFuture(loop=asyncio.get_event_loop())

    # Kill browser if needed
    BrowserHandler.destroy(config.driver)

    if len(config.extra_records) > 0:
        typesense_helper.add_records(
            config.extra_records, "Extra records", False)
    print("")

    if DocumentationSpider.NB_INDEXED > 0:
        typesense_helper.commit_tmp_collection()
        print('Nb hits: {}'.format(DocumentationSpider.NB_INDEXED))
        config.update_nb_hits_value(DocumentationSpider.NB_INDEXED)
    else:
        print('Crawling issue: nbHits 0 for ' + config.index_name)
        exit(EXIT_CODE_NO_RECORD)
    print("")
j
There's not much information here to figure out what the issue is
I'd recommend tracing through the scraper code along with your config file in a debugger to see what's happening
Most likely the issue is that the config file doesn't list all the base URLs
r
But I check the log and can see we get all the links inside
start_urls
of config . can it be the crawler issue ?
Copy code
crawler = CrawlerRunner(settings)
    crawler.crawl(DocumentationSpider,
                  config=config,
                  typesense_helper=typesense_helper,
                  strategy=strategy)

    await crawler.join().asFuture(loop=asyncio.get_event_loop())
Complete Log
Copy code
Modified config {"index_name": "Developer_Docs_blog", "js_render": true, "use_anchors": false, "user_agent": "Custom Bot", "start_urls": ["<https://juspay.io/in/docs/blog/magento/overview/pre-requisites>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/installation>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/generating-rsa-key-pair>", "<https://juspay.io/in/docs/blog/magento/base-sdk-integration/magento-backend-setup>", "<https://juspay.io/in/docs/blog/magento/resources/useful-links>", "<https://juspay.io/in/docs/blog/magento/resources/transaction-status>", "<https://juspay.io/in/docs/blog/magento/resources/refunds>", "<https://juspay.io/in/docs/blog/woocommerce/overview/pre-requisites>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/installation>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/generating-rsa-key-pair>", "<https://juspay.io/in/docs/blog/woocommerce/base-sdk-integration/woocommerce-dashboard-setup>", "<https://juspay.io/in/docs/blog/woocommerce/resources/useful-links>", "<https://juspay.io/in/docs/blog/woocommerce/resources/transaction-status>", "<https://juspay.io/in/docs/blog/woocommerce/resources/refunds>"], "selectors": {"lvl0": "[data-search-class='lvl0']", "lvl1": "[data-search-class='lvl1']", "lvl2": "[data-search-class='lvl2'],[data-search-class='lvl3']", "text": "[data-search-class='text']"}, "strip_chars": " .,;:#", "scrap_start_urls": true, "custom_settings": {}}
INFO:scrapy.addons:Enabled addons:
[]
2024-12-25 00:36:35 [scrapy.addons] INFO: Enabled addons:
[]
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'INFO',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Custom Bot'}
2024-12-25 00:36:35 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scraper.src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'INFO',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Custom Bot'}
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scraper.src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
2024-12-25 00:36:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
2024-12-25 00:36:35 [scrapy.core.engine] INFO: Spider opened
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:44157/session/804708fe4264d7931c62dddf894bbe46/window> {}
DEBUG:urllib3.connectionpool:<http://localhost:44157> "DELETE /session/804708fe4264d7931c62dddf894bbe46/window HTTP/1.1" 200 12
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:44157/session/804708fe4264d7931c62dddf894bbe46> {}
DEBUG:urllib3.connectionpool:<http://localhost:44157> "DELETE /session/804708fe4264d7931c62dddf894bbe46 HTTP/1.1" 200 14
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
Crawling config
Crawling issue: nbHits 0 for Developer_Docs_blog
my index.py file . Entry point of my file
run_update_config
@Jason Bosco can you please have a small look . stuck here more than 1 week .
also this are the logs of Typesense status
Copy code
DEBUG:typesense.api_call:Making delete /collections/Developer_Docs_blog_1735156203
DEBUG:typesense.api_call:Try 1 to node typesense.jp-internal.svc.cluster.local:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): typesense.jp-internal.svc.cluster.local:8108
DEBUG:urllib3.connectionpool:<http://typesense.jp-internal.svc.cluster.local:8108> "DELETE /collections/Developer_Docs_blog_1735156203 HTTP/1.1" 404 None
DEBUG:typesense.api_call:typesense.jp-internal.svc.cluster.local:8108 is healthy. Status code: 404
DEBUG:typesense.api_call:Making post /collections
DEBUG:typesense.api_call:Try 1 to node typesense.jp-internal.svc.cluster.local:8108 -- healthy? True
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): typesense.jp-internal.svc.cluster.local:8108
DEBUG:urllib3.connectionpool:<http://typesense.jp-internal.svc.cluster.local:8108> "POST /collections HTTP/1.1" 201 None
DEBUG:typesense.api_call:typesense.jp-internal.svc.cluster.local:8108 is healthy. Status code: 201
INFO:scrapy.addons:Enabled addons:
the log of crawler execution time
Copy code
>>>>Crawler started. 7306.301826105
>>>>Crawler completed. 7306.303974783
>>>>Execution time: 0.00 seconds
@Jason Bosco @Kishore Nallan can you guys please help me out what can be the issue . same code working perfectly before but after change the driver code this issue is happing .
Copy code
driver = webdriver.Chrome(service=Service(), options=chrome_options)
same driver we are able to fetch sitemap urls using driver but can't able to scrape the data . also didn't getting any Error log also . kindly please have a look
j
We are unfortunately unable to provide this level of involved free support on the crawler especially considering that you’ve forked it and customized it, and we’ve already spent several hours earlier this year or last year IIRC, trying to debug variations of these issues with you. The crawler code is fully open source, so you want to run it through a debugger and step through the code to identify what’s going on.
r
sure & tanks @Jason Bosco
Hey @Jason Bosco, I have identified the issue. It occurred due to a version update. This problem is also present in the current GitHub repository of the Typesense DocSearch project.
j
Found the issue and pushed out a fix for it. Could you pull the latest version of the code from the Typesense Docsearch github repo and try again?
r
cool . I have already fixed the issue on my repo .