#community-help

Error with Scrapy `run_config` Function when Running Twice

TLDR Rubai encountered a "ReactorNotRestartable" error when running run_config function twice in index.py. They shared their code, but Jason advised posting on Stackoverflow as the issue isn't Typesense-specific.

Powered by Struct AI
Mar 30, 2023 (6 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
10:57 AM
Hi guys , when I run run_config function from index.py , for the 1st time it's working fine . but 2nd time when we run we got an error the error is
  File "/Users/rubai.mandal/docsearch-scrapper/scraper/src/index.py", line 95, in run_config
    process.start()
  File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/scrapy/crawler.py", line 383, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 1317, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 1299, in startRunning
    ReactorBase.startRunning(cast(ReactorBase, self))
  File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 843, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

can anyone help me out why this thing happens ?
I am attaching the run_config function on the thread
10:58
Rubai
10:58 AM
def run_config(config):
    config = ConfigLoader(config)
    CustomDownloaderMiddleware.driver = config.driver
    DocumentationSpider.NB_INDEXED = 0

    strategy = DefaultStrategy(config)

    typesense_helper = TypesenseHelper(
        config.index_name,
        config.index_name_tmp
    )
    typesense_helper.create_tmp_collection()

    root_module = 'src.' if __name__ == '__main__' else 'scraper.src.'
    DOWNLOADER_MIDDLEWARES_PATH = root_module + 'custom_downloader_middleware.' + CustomDownloaderMiddleware.__name__
    DUPEFILTER_CLASS_PATH = root_module + 'custom_dupefilter.' + CustomDupeFilter.__name__

    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en",
    }  # Defaults for scrapy 

    if os.getenv("CF_ACCESS_CLIENT_ID") and os.getenv("CF_ACCESS_CLIENT_SECRET"):
        headers.update(
            {
                "CF-Access-Client-Id": os.getenv("CF_ACCESS_CLIENT_ID"),
                "CF-Access-Client-Secret": os.getenv("CF_ACCESS_CLIENT_SECRET"),
            }
        )
    elif os.getenv("IAP_AUTH_CLIENT_ID") and os.getenv("IAP_AUTH_SERVICE_ACCOUNT_JSON"):
        iap_token = IAPAuth(
            client_id=os.getenv("IAP_AUTH_CLIENT_ID"),
            service_account_secret_dict=json.loads(
                os.getenv("IAP_AUTH_SERVICE_ACCOUNT_JSON")
            ),
        )(requests.Request()).headers["Authorization"]
        headers.update({"Authorization": iap_token})

    DEFAULT_REQUEST_HEADERS = headers
    settings = {
        'LOG_ENABLED': '1',
        'LOG_LEVEL': 'ERROR',
        'USER_AGENT': config.user_agent,
        'DOWNLOADER_MIDDLEWARES': {DOWNLOADER_MIDDLEWARES_PATH: 900},
        # Need to be > 600 to be after the redirectMiddleware
        'DUPEFILTER_USE_ANCHORS': config.use_anchors,
        # Use our custom dupefilter in order to be scheme agnostic regarding link provided
        'DUPEFILTER_CLASS': DUPEFILTER_CLASS_PATH,
        'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS,
        'TELNETCONSOLE_ENABLED': False
    }
    process = CrawlerProcess(settings)

    process.crawl(
        DocumentationSpider,
        config=config,
        typesense_helper=typesense_helper,
        strategy=strategy
    )
    
    process.start()
    process.stop()

    # Kill browser if needed
    BrowserHandler.destroy(config.driver)

    if len(config.extra_records) > 0:
        typesense_helper.add_records(config.extra_records, "Extra records", False)

    print("")

    if DocumentationSpider.NB_INDEXED > 0:
        typesense_helper.commit_tmp_collection()
        print('Nb hits: {}'.format(DocumentationSpider.NB_INDEXED))
        config.update_nb_hits_value(DocumentationSpider.NB_INDEXED)
    else:
        print('Crawling issue: nbHits 0 for ' + config.index_name)
        exit(EXIT_CODE_NO_RECORD)
    print("")
Apr 03, 2023 (6 months ago)
Rubai
Photo of md5-89fb99de3bf7e23767aaf9108a5636ad
Rubai
09:45 AM
Jason anything about this ?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:26 PM
I’m not sure what’s happening here. But since it’s not Typesense-specific, I’d recommend posting on Stackoverflow