Hi! I am running the Typsesense server, Docusaurus...
# community-help
k
Hi! I am running the Typsesense server, Docusaurus and the Docker typesense/docsearch-scraper image all from Ubuntu WSL. Docusaurus is running on port 80, as it should for scraping. When I run the scraper, the following command is invoked:
docker run -it --env-file=./Localhost-Environment.env -e "CONFIG=$(cat ./Localhost-Config.json | jq -r tostring)" typesense/docsearch-scraper
The content of the
Localhost-Environment.env
file is as follows:
Copy code
TYPESENSE_API_KEY=xyz
TYPESENSE_HOST=host.docker.internal
TYPESENSE_PORT=8108
TYPESENSE_PROTOCOL=http
The content of the
Localhost-Config.json
file is as follows:
Copy code
{
  "index_name": "Tutorial",
  "start_urls": [
    {
      "url": "<http://host.docker.internal>"
    }
  ],
  "js_render": true,
  "selectors": {
    "lvl0": "h1",
    "lvl1": "h2",
    "lvl2": "h3",
    "lvl3": "h4",
    "lvl4": "h5",
    "lvl5": "h6",
    "text": "p, li"
  },
  "strip_chars": " .,;:#"
}
When Docker is run, there are
0 nbHits
, as shown below:
Copy code
INFO:scrapy.utils.log:Scrapy 2.9.0 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.2.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.1, Twisted 22.10.0, Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0], pyOpenSSL 23.1.1 (OpenSSL 3.1.0 14 Mar 2023), cryptography 40.0.2, Platform Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
INFO:scrapy.crawler:Overridden settings:
{'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter',
 'LOG_ENABLED': '1',
 'LOG_LEVEL': 'ERROR',
 'TELNETCONSOLE_ENABLED': False,
 'USER_AGENT': 'Typesense DocSearch Scraper (Bot; '
               '<https://typesense.org/docs/guide/docsearch.html)>'}
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/utils/request.py:232: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor
INFO:scrapy.middleware:Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
INFO:scrapy.middleware:Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'src.custom_downloader_middleware.CustomDownloaderMiddleware']
INFO:scrapy.middleware:Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
INFO:scrapy.middleware:Enabled item pipelines:
[]
INFO:scrapy.core.engine:Spider opened
WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:89: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method.
  warn(

WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.10/site-packages/scrapy/dupefilters.py:53: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  self.fingerprinter = fingerprinter or RequestFingerprinter()

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Getting <http://host.docker.internal> from selenium
DEBUG:selenium.webdriver.remote.remote_connection:POST <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/url> {"url": "<http://host.docker.internal>"}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "POST /session/09a09fcba0f26c739b06d0ae1bd831d7/url HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/source> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "GET /session/09a09fcba0f26c739b06d0ae1bd831d7/source HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"\u003Chtml lang=\"en\" data-theme=\"light\" dir=\"ltr\" data-rh=\"lang,dir\">\u003Chead>\n    \u003Cmeta charset=\"utf-8\">\n    \u003Cmeta name=\"generator\" content=\"Docusaurus\">\n    \u003Ctitle>My Site\u003C/title>\n    \u003Clink rel=\"alternate\" type=\"application/rss+xml\" href=\"/blog/rss.xml\" title=\"My Site RSS Feed\">\n\u003Clink rel=\"alternate\" type=\"application/atom+xml\" href=\"/blog/atom.xml\" title=\"My Site Atom Feed\">\n    \u003Cscript defer=\"\" src=\"/runtime~main.js\">\u003C/script>\u003Cscript defer=\"\" src=\"/main.js\">\u003C/script>\u003Clink href=\"/styles.css\" rel=\"stylesheet\">\n  \u003Clink rel=\"icon\" href=\"/img/favicon.ico\" data-rh=\"true\">\u003Clink rel=\"canonical\" href=\"<https://your-docusaurus-site.example.com/>\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"<https://your-docusaurus-site.example.com/>\" hreflang=\"en\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"<https://your-docusaurus-site.example.com/>\" hreflang=\"x-default\" data-rh=\"true\">\u003Cmeta property=\"og:title\" content=\"My Site\" data-rh=\"true\">\u003Cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" data-rh=\"true\">\u003Cmeta name=\"twitter:card\" content=\"summary_large_image\" data-rh=\"true\">\u003Cmeta property=\"og:image\" content=\"<https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg>\" data-rh=\"true\">\u003Cmeta name=\"twitter:image\" content=\"<https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg>\" data-rh=\"true\">\u003Cmeta property=\"og:url\" content=\"<https://your-docusaurus-site.example.com/>\" data-rh=\"true\">\u003Cmeta property=\"og:locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta name=\"docsearch:language\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docsearch:docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003C/head>\n  \u003Cbody class=\"navigation-with-keyboard\" data-rh=\"class\">\n    \u003Cscript>\n(function() {\n  var defaultMode = 'light';\n  var respectPrefersColorScheme = false;\n\n  function setDataThemeAttribute(theme) {\n    document.documentElement.setAttribute('data-theme', theme);\n  }\n\n  function getQueryStringTheme() {\n    try {\n      return new URLSearchParams(window.location.search).get('docusaurus-theme')\n    } catch(e) {}\n  }\n\n  function getStoredTheme() {\n    try {\n      return localStorage.getItem('theme');\n    } catch (err) {}\n  }\n\n  var initialTheme = getQueryStringTheme() || getStoredTheme();\n  if (initialTheme !== null) {\n    setDataThemeAttribute(initialTheme);\n  } else {\n    if (\n      respectPrefersColorScheme &&\n      window.matchMedia('(prefers-color-scheme: dark)').matches\n    ) {\n      setDataThemeAttribute('dark');\n    } else if (\n      respectPrefersColorScheme &&\n      window.matchMedia('(prefers-color-scheme: light)').matches\n    ) {\n      setDataThemeAttribute('light');\n    } else {\n      setDataThemeAttribute(defaultMode === 'dark' ? 'dark' : 'light');\n    }\n  }\n})();\n\n(function() {\n  try {\n    const entries = new URLSearchParams(window.location.search).entries();\n    for (var [searchKey, value] of entries) {\n      if (searchKey.startsWith('docusaurus-data-')) {\n        var key = searchKey.replace('docusaurus-data-',\"data-\")\n        document.documentElement.setAttribute(key, value);\n      }\n    }\n  } catch(e) {}\n})();\n\n\n            \u003C/script>\n    \u003Cdiv id=\"__docusaurus\">\u003C/div>\n    \n    \n  \n\n\u003C/body>\u003C/html>"} | headers=HTTPHeaderDict({'Content-Length': '3665', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:selenium.webdriver.remote.remote_connection:GET <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7/url> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "GET /session/09a09fcba0f26c739b06d0ae1bd831d7/url HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"<http://host.docker.internal/>"} | headers=HTTPHeaderDict({'Content-Length': '40', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request
DEBUG:scrapy.core.engine:Crawled (200) <GET <http://host.docker.internal>> (referer: None)
> DocSearch: <http://host.docker.internal/> 0 records)
INFO:scrapy.core.engine:Closing spider (finished)
INFO:scrapy.statscollectors:Dumping Scrapy stats:
{'downloader/request_bytes': 268,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 3279,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.855292,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 4, 22, 13, 48, 36, 258071),
 'memusage/max': 68087808,
 'memusage/startup': 68087808,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2024, 4, 22, 13, 48, 35, 402779)}
INFO:scrapy.core.engine:Spider closed (finished)
DEBUG:selenium.webdriver.remote.remote_connection:DELETE <http://localhost:35887/session/09a09fcba0f26c739b06d0ae1bd831d7> {}
DEBUG:urllib3.connectionpool:<http://localhost:35887> "DELETE /session/09a09fcba0f26c739b06d0ae1bd831d7 HTTP/1.1" 200 0
DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
DEBUG:selenium.webdriver.remote.remote_connection:Finished Request

Crawling issue: nbHits 0 for Tutorial
Please forgive the long post, but I am at wits' end.
j
What is the URL of the website you want to scrape? Is that running on your local machine?
k
Yes. The URL is 127.0.0.1:80. It is defined in the ENV file as http://host.docker.internal/. The same messages occur when the Typesense server, the docker scraper, and the Docusaurus web site are all running on Debian WSL or Ubuntu WSL.
j
Could you run the site locally on the same IP address that http://host.docker.internal/ is pointing to? I suspect that the docker hostname is pointing to a subnet like
172.x.x.x
and so it's not able to connect to 127.0.0.1... Just a guess. I'm not familiar with how Docker networking on Windows + WSL is setup. Might be best to ask on Stackoverflow
k
The message that the Typesesne server lists when an attemp to scrape fails is as follows:
Create collection request body is malformed.
Thanks!
It is apparently a Docker issue and not a Typesense issue. defining host.docker.internal as the value of the TYPESENSE_HOST variable passed to the scraper works in certain situations but not others, or so at least it seems.
It worked for me a few months ago, now it does not.
The problem was with one single Docusaurus site, that of the Docusaurus Tutorial. When the other Docusaurus sites are built on Ubuntu WSL, and the Typesense server is running on Ubuntu USL and when the Docker scraper is being run on Ubuntu WSL, the scraper 'sees' the sites and creates a collection correctly. I have reached out to Docusaurus to see if there are some anti-scraping features built into the Docusaurus Tutorial. Thanks for your help.!
👍 1