Error with Scrapy `run_config` Function when Running Twice
TLDR Rubai encountered a "ReactorNotRestartable" error when running run_config
function twice in index.py
. They shared their code, but Jason advised posting on Stackoverflow as the issue isn't Typesense-specific.
Mar 30, 2023 (6 months ago)
Rubai
10:57 AMrun_config
function from index.py
, for the 1st time it's working fine . but 2nd time when we run we got an error the error is File "/Users/rubai.mandal/docsearch-scrapper/scraper/src/index.py", line 95, in run_config
process.start()
File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/scrapy/crawler.py", line 383, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 1317, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 1299, in startRunning
ReactorBase.startRunning(cast(ReactorBase, self))
File "/Users/rubai.mandal/Library/Python/3.9/lib/python/site-packages/twisted/internet/base.py", line 843, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
can anyone help me out why this thing happens ?
I am attaching the
run_config
function on the threadRubai
10:58 AMdef run_config(config):
config = ConfigLoader(config)
CustomDownloaderMiddleware.driver = config.driver
DocumentationSpider.NB_INDEXED = 0
strategy = DefaultStrategy(config)
typesense_helper = TypesenseHelper(
config.index_name,
config.index_name_tmp
)
typesense_helper.create_tmp_collection()
root_module = 'src.' if __name__ == '__main__' else 'scraper.src.'
DOWNLOADER_MIDDLEWARES_PATH = root_module + 'custom_downloader_middleware.' + CustomDownloaderMiddleware.__name__
DUPEFILTER_CLASS_PATH = root_module + 'custom_dupefilter.' + CustomDupeFilter.__name__
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en",
} # Defaults for scrapy
if os.getenv("CF_ACCESS_CLIENT_ID") and os.getenv("CF_ACCESS_CLIENT_SECRET"):
headers.update(
{
"CF-Access-Client-Id": os.getenv("CF_ACCESS_CLIENT_ID"),
"CF-Access-Client-Secret": os.getenv("CF_ACCESS_CLIENT_SECRET"),
}
)
elif os.getenv("IAP_AUTH_CLIENT_ID") and os.getenv("IAP_AUTH_SERVICE_ACCOUNT_JSON"):
iap_token = IAPAuth(
client_id=os.getenv("IAP_AUTH_CLIENT_ID"),
service_account_secret_dict=json.loads(
os.getenv("IAP_AUTH_SERVICE_ACCOUNT_JSON")
),
)(requests.Request()).headers["Authorization"]
headers.update({"Authorization": iap_token})
DEFAULT_REQUEST_HEADERS = headers
settings = {
'LOG_ENABLED': '1',
'LOG_LEVEL': 'ERROR',
'USER_AGENT': config.user_agent,
'DOWNLOADER_MIDDLEWARES': {DOWNLOADER_MIDDLEWARES_PATH: 900},
# Need to be > 600 to be after the redirectMiddleware
'DUPEFILTER_USE_ANCHORS': config.use_anchors,
# Use our custom dupefilter in order to be scheme agnostic regarding link provided
'DUPEFILTER_CLASS': DUPEFILTER_CLASS_PATH,
'DEFAULT_REQUEST_HEADERS': DEFAULT_REQUEST_HEADERS,
'TELNETCONSOLE_ENABLED': False
}
process = CrawlerProcess(settings)
process.crawl(
DocumentationSpider,
config=config,
typesense_helper=typesense_helper,
strategy=strategy
)
process.start()
process.stop()
# Kill browser if needed
BrowserHandler.destroy(config.driver)
if len(config.extra_records) > 0:
typesense_helper.add_records(config.extra_records, "Extra records", False)
print("")
if DocumentationSpider.NB_INDEXED > 0:
typesense_helper.commit_tmp_collection()
print('Nb hits: {}'.format(DocumentationSpider.NB_INDEXED))
config.update_nb_hits_value(DocumentationSpider.NB_INDEXED)
else:
print('Crawling issue: nbHits 0 for ' + config.index_name)
exit(EXIT_CODE_NO_RECORD)
print("")
Apr 03, 2023 (6 months ago)
Rubai
09:45 AMJason
02:26 PMTypesense
Indexed 2786 threads (79% resolved)
Similar Threads
Trouble with DocSearch Scraper and Pipenv Across Multiple OSs
James ran into errors when trying to build Typesense DocSearch Scraper from scratch, and believes it’s because of a bad Pipfile.lock. Jason attempted to replicate the error, and spent hours trying to isolate the issue but ultimately fixed the problem and copied his bash history for future reference. The conversation touches briefly on the subject of using a virtual machine for testing.
Troubleshooting Typesense Docsearch Scraper Setup Issue
Vinicius experienced issues setting up typesense-docsearch-scraper locally. Jason identified a misconfiguration with the Typesense server after checking the .env file, and recommended using ngrok or port forwarding for development purposes. Vinicius successfully resolved the issue with port forwarding.
Troubleshooting Local Scraper & Sitemap Issues
Rubai experienced issues with a local scraper and sitemap URLs not working. Jason instructed them to use meta tags and adjust their config file, which resolved the issues.