#community-help

Troubleshooting Typesense Scraper Issues with Docusaurus

TLDR Abdulrahman is experiencing problems with Typesense Scraper using Docusaurus. Jason suggested checking for client-side redirects and inspecting the returned HTML from the root path. However, the issue is unresolved.

Powered by Struct AI
Oct 10, 2023 (1 month ago)
Abdulrahman
Photo of md5-43e9b4e4d7702eedb858cb09d3ffa98f
Abdulrahman
10:46 AM
I am having problems using the scraper (https://typesense.org/docs/guide/docsearch.html#step-1-set-up-docsearch-scraper), I am using Docusaurus. I did edit the config to include the site url and sitemap.xml file. Only the page listed in the siteurl is scraped and indexed. Does anyone have a clue on how to fix this? The sitemap URL and the URLs inside it match the pattern of the start_urls (all start with http:// and no www) btw.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:05 PM
I've seen this issue happen when there is a client-side redirect configured somehow
04:05
Jason
04:05 PM
May be try enabling JS rendering to see if that fixes the issue?
Oct 11, 2023 (1 month ago)
Abdulrahman
Photo of md5-43e9b4e4d7702eedb858cb09d3ffa98f
Abdulrahman
08:07 AM
Yeah the way we have structured the Docusaurus website is there is no /docs path, the documentation is placed directly at the base url so that could be the issue. I have tried enabling JS rendering on the TypeSense config file but to no avail. How could I solve this issue? Currently I had to write a script to get all the urls from the sitemap and attach it to the config json file, this is obviously not ideal. I appreciate the help.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:33 PM
Could you do a curl to the root path and inspect the html returned? That could give you clues into what the scraper processes... and if there are no links in that document, then that's the source of the issue
Oct 16, 2023 (1 month ago)
Abdulrahman
Photo of md5-43e9b4e4d7702eedb858cb09d3ffa98f
Abdulrahman
11:56 AM
Apologies for the delayed response, I sent a GET request using Postman to the root path of the website and got the content with the links in the following format "/foo/bar/", I don't know if that gives any clues, I also sent a GET request to the sitemap and it retrieved it with the full links to all the pages. So I still do not know what would cause this issue, any idea?