Kevin Donovan
04/22/2024, 1:53 PMdocker run -it --env-file=./Localhost-Environment.env -e "CONFIG=$(cat ./Localhost-Config.json | jq -r tostring)" typesense/docsearch-scraper
The content of the Localhost-Environment.env
file is as follows:
TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=host.docker.internal
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
The content of the Localhost-Config.json
file is as follows:
{
"index_name": "Tutorial",
"start_urls": [
{
"url": "<http://host.docker.internal>"
}
],
"js_render": true,
"selectors": {
"lvl0": "h1",
"lvl1": "h2",
"lvl2": "h3",
"lvl3": "h4",
"lvl4": "h5",
"lvl5": "h6",
"text": "p, li"
},
"strip_chars": " .,;:#"
}
When Docker is run, there are 0 nbHits
, as shown below:
INFO:scrapy.utils.log:Scrapy 2.9.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
'LOG_ENABLED': '1',
'LOG_LEVEL': 'ERROR',
'TELNETCONSOLE_ENABLED': False,
'USER_AGENT': 'Typesense DocSearch Scraper (Bot; '
'<https://typesense.org/docs/guide/docsearch.html)>'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
warn(
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
self.fingerprinter = fingerprinter or RequestFingerprinter()
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Getting <http://host.docker.internal> from selenium
DEBUG:selenium.webdriver.remote.remote_connection:POST <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/url> {"url": "<http://host.docker.internal>"}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "POST /session/09a09fcba0f26c739b06d0ae1bd831d7/url HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/source> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "GET /session/09a09fcba0f26c739b06d0ae1bd831d7/source HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"\u003Chtml lang=\"en\" data-theme=\"light\" dir=\"ltr\" data-rh=\"lang,dir\">\u003Chead>\n \u003Cmeta charset=\"utf-8\">\n \u003Cmeta name=\"generator\" content=\"Docusaurus\">\n \u003Ctitle>My Site\u003C/title>\n \u003Clink rel=\"alternate\" type=\"application/rss+xml\" href=\"/blog/rss.xml\" title=\"My Site RSS Feed\">\n\u003Clink rel=\"alternate\" type=\"application/atom+xml\" href=\"/blog/atom.xml\" title=\"My Site Atom Feed\">\n \u003Cscript defer=\"\" src=\"/runtime~main.js\">\u003C/script>\u003Cscript defer=\"\" src=\"/main.js\">\u003C/script>\u003Clink href=\"/styles.css\" rel=\"stylesheet\">\n \u003Clink rel=\"icon\" href=\"/img/favicon.ico\" data-rh=\"true\">\u003Clink rel=\"canonical\" href=\"<https://your-docusaurus-site.example.com/>\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"<https://your-docusaurus-site.example.com/>\" hreflang=\"en\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"<https://your-docusaurus-site.example.com/>\" hreflang=\"x-default\" data-rh=\"true\">\u003Cmeta property=\"og:title\" content=\"My Site\" data-rh=\"true\">\u003Cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" data-rh=\"true\">\u003Cmeta name=\"twitter:card\" content=\"summary_large_image\" data-rh=\"true\">\u003Cmeta property=\"og:image\" content=\"<https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg>\" data-rh=\"true\">\u003Cmeta name=\"twitter:image\" content=\"<https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg>\" data-rh=\"true\">\u003Cmeta property=\"og:url\" content=\"<https://your-docusaurus-site.example.com/>\" data-rh=\"true\">\u003Cmeta property=\"og:locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta name=\"docsearch:language\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docsearch:docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003C/head>\n \u003Cbody class=\"navigation-with-keyboard\" data-rh=\"class\">\n \u003Cscript>\n(function() {\n var defaultMode = 'light';\n var respectPrefersColorScheme = false;\n\n function setDataThemeAttribute(theme) {\n document.documentElement.setAttribute('data-theme', theme);\n }\n\n function getQueryStringTheme() {\n try {\n return new URLSearchParams(window.location.search).get('docusaurus-theme')\n } catch(e) {}\n }\n\n function getStoredTheme() {\n try {\n return localStorage.getItem('theme');\n } catch (err) {}\n }\n\n var initialTheme = getQueryStringTheme() || getStoredTheme();\n if (initialTheme !== null) {\n setDataThemeAttribute(initialTheme);\n } else {\n if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: dark)').matches\n ) {\n setDataThemeAttribute('dark');\n } else if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: light)').matches\n ) {\n setDataThemeAttribute('light');\n } else {\n setDataThemeAttribute(defaultMode === 'dark' ? 'dark' : 'light');\n }\n }\n})();\n\n(function() {\n try {\n const entries = new URLSearchParams(window.location.search).entries();\n for (var [searchKey, value] of entries) {\n if (searchKey.startsWith('docusaurus-data-')) {\n var key = searchKey.replace('docusaurus-data-',\"data-\")\n document.documentElement.setAttribute(key, value);\n }\n }\n } catch(e) {}\n})();\n\n\n \u003C/script>\n \u003Cdiv id=\"__docusaurus\">\u003C/div>\n \n \n \n\n\u003C/body>\u003C/html>"} | headers=HTTPHeaderDict({'Content-Length': '3665', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/url> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "GET /session/09a09fcba0f26c739b06d0ae1bd831d7/url HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"<http://host.docker.internal/>"} | headers=HTTPHeaderDict({'Content-Length': '40', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:scrapy.core.engine:Crawled (200) <GET <http://host.docker.internal>> (referer: None)
> DocSearch: <http://host.docker.internal/> 0 records)
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 268,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 3279,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.855292,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 22, 13, 48, 36, 258071),
'memusage/max': 68087808,
'memusage/startup': 68087808,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2024, 4, 22, 13, 48, 35, 402779)}
INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "DELETE /session/09a09fcba0f26c739b06d0ae1bd831d7 HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
Crawling issue: nbHits 0 for Tutorial
Please forgive the long post, but I am at wits' end.Jason Bosco
04/22/2024, 7:50 PMKevin Donovan
04/23/2024, 7:17 AMJason Bosco
04/23/2024, 4:16 PM172.x.x.x
and so it's not able to connect to 127.0.0.1... Just a guess. I'm not familiar with how Docker networking on Windows + WSL is setup.
Might be best to ask on StackoverflowKevin Donovan
04/23/2024, 4:27 PMCreate collection request body is malformed.
Kevin Donovan
04/23/2024, 4:39 PMKevin Donovan
04/24/2024, 9:40 AMKevin Donovan
04/24/2024, 9:41 AMKevin Donovan
04/25/2024, 7:27 AM