TLDR Kevin had problems migrating a Typesense collection between Docusaurus sites on different machines. Jason advised them on JSONL format, handling server hosting, and creating a collection schema before importing documents, leading to successful import.
We did the following - 1.) exported the localhost typesense collection 2.) changed the Docusaurus URLs in the collection JSON file to those of the Docusaurus site in the test environment 3.) created a collection in the Typesense server in the test environment 4.) converted the collection JSON file to a JSONL file and 5.) attempted to import the JSONL file to the newly created collection on the Typesense server on the test platform. Unfortunately, nothing was imported. Here is a sample of the error messages displayed: `{"code":400,"document":"\"symbology\"","error":"Bad JSON: not a properly formed document.","success":false}` `{"code":400,"document":"\"etc\"","error":"Bad JSON: not a properly formed document.","success":false}` `{"code":400,"document":"\"etc\"","error":"Bad JSON: not a properly formed document.","success":false}` `{"code":400,"document":"\"docs-default-current\"","error":"Bad JSON: not a properly formed document.","success":false}` Would anyone know if this is the correct approach? Typesense is a great tool, but it does not appear to be possible to import a collection by itself, at least via curl. Or maybe this is a problem with scraped docusaurus sites? Maybe the exported JSON collection file needs to be modified in some way prior to conversion to JSONL? Typesense meets our security needs, but we do need to test it thoroughly first. Thank you all! NOTE: If possible, we would have scraped the test Docusaurus site, but it is behind a login and password and Cloudflare Zero Trust (CF), Google Identity-Aware Proxy (IAP) and Keycloak (KC) are not used.
The format of the content exported by the documents/export endpoint is already in JSONL, so you wouldn’t need to change the format in any way. Could you share the first two lines from the JSONL file you’re trying to import into the new collection? `head -2 your-exported-documents.jsonl`
Here are the first 7: ```"6.5" "6.5" "default" {"lvl0":null,"lvl1":null,"lvl2":null,"lvl3":null,"lvl4":null,"lvl5":null,"lvl6":null} [{"lvl0":null,"lvl1":null,"lvl2":null,"lvl3":null,"lvl4":null,"lvl5":null,"lvl6":null}] {"lvl0":null,"lvl1":null,"lvl2":null,"lvl3":null,"lvl4":null,"lvl5":null} {"lvl0":null,"lvl1":null,"lvl2":null,"lvl3":null,"lvl4":null,"lvl5":null}```
In retrospect, I suppose the structure of this file does seem illogical.
After giving the exported JSON file the extension'.jsonl', I was able to initiate an import. everything processed OK for 30 seconds or so, but then the message 502 Bad Gateway appeared.
Are you using curl to import the JSONL file? Or did you use the Cloud web interface? For large files you want to use the API
I am using curl. The file is 12 MB.
The 502 problem has gone away after we re-installed the Typesense server. Now when I attempt to create a collection through an import, the screen returns the following: ``` % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 11.4M 0 24 100 11.4M 1 817k 0:00:14 0:00:14 --:--:-- 253k{"message": "Not Found"}```
The `11.4M` in the screen output corresponds to the size of the JSONL file.
Note that I ran curl on bash, if that matters.
Oh wait I forgot that you’re self hosting. So the 502 happens if the gateway / reverse-proxy you have in front of Typesense terminates the connection before the import is fully done. So you want to increase that timeout to as high as say 30 minutes.
The not found issue is separate - you need to first create the collection before importing documents into it
Ok to create the collection, does that mean I need to specify a schema for it as well?
Thanks!
Correct
OK, is it possible to export the schema from an existing collection? Thanks!
With, of course, the intention of using the exported schema to create a new collection on a differnt server.
Yup:
You can then the output JSON of that endpoint to the create collection endpoint:
Hi! I did as you said. I was able to export the schema, and then create a collection using the schema.
But when I attempted an import, I received lots of messages such as these: ```{"code":400,"document":" \"current\"","error":"Bad JSON: not a properly formed document.","success":false} {"code":400,"document":" ],","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 3: syntax error while parsing value - unexpected ']'; expected '[', '{', or a literal","success":false} {"code":400,"document":" \"weight\": {","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 11: syntax error while parsing value - unexpected ':'; expected end of input","success":false} {"code":400,"document":" \"level\": 0,","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 12: syntax error while parsing value - unexpected ':'; expected end of input","success":false} {"code":400,"document":" \"page_rank\": 0,","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 16: syntax error while parsing value - unexpected ':'; expected end of input","success":false} {"code":400,"document":" \"position\": 57,","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 15: syntax error while parsing value - unexpected ':'; expected end of input","success":false} {"code":400,"document":" \"position_descending\": 1","error":"Bad JSON: [json.exception.parse_error.101] parse error at line 1, column 26: syntax error while parsing value - unexpected ':'; expected end of input","success":```
Any idea why?
Do I need to removed EOL characters in the JSONL file?
Could you share the first few lines on the JSONL file again?
```{ "content": "6.5", "content_camel": "6.5", "docusaurus_tag": "default", "hierarchy": { "lvl0": null, "lvl1": null, "lvl2": null, "lvl3": null, "lvl4": null, "lvl5": null, "lvl6": null },```
Or rather this is the file exported from the other Typesense server.
JSONL file needs to be one JSON object per line. For eg: ```{"id": "124", "company_name": "Stark Industries", "num_employees": 5215, "country": "US"} {"id": "125", "company_name": "Future Technology", "num_employees": 1232, "country": "UK"} {"id": "126", "company_name": "Random Corp.", "num_employees": 531, "country": "AU"}```
This is how the Typesense export endpoint exports docs as well
It seems like there’s some additional processing you might be doing that’s outputting a formatted JSON object with line-breaks between key values
OK thanks
You are right. It works. I was able to import. Thanks much!
Just one follow up question - do most of the people who use Typesense NOT use curl? In other words do more people use one of the APIs rather than curl to interact with the Typsense server?
Typesense Server exposes an API, and curl just calls that same API
So whether you use curl / browser / client library - it’s the same API that gets called
Kevin
Tue, 15 Aug 2023 12:44:48 UTCHello everyone! Typesense is awesome. We have successfully integrated Typesense with Docusaurus on localhost, where both Docusaurus and the Typesense server are running on the same machine and where the typesense/docsearch-scraper docker job has been previously run on the same machine and scraped the localhost Docusaurus site. We would like to move the collection that was created by running typesense/docsearch-scraper to a Typesense server running in the test environment but we are having problems. details in thread.