But the issue is only a part of the data about 50k...
# random
s
But the issue is only a part of the data about 50k Documents are loaded into the collection. And in the APIs i also see { "message": "Not Ready or Lagging" } Can you please help what would be the best option to load large number of documents without having to break it down in to lot of smaller JSONL files. Because if i load files with about 20k documents it loads fast and smooth. Is there any server config or setting that i am missing? Below is the CURL command i am sending curl "${TYPESENSE_HOST}/collections/AdditionalContacts/documents/import?batch_size=1000" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -T additional_contacts_10l-1499999.jsonl | jq
k
When you send a large number of data in a single batch, there is a back pressure mechanism to prevent the server from getting overhwlemed because that will affect searches.
We will have to see if we can support automatic backpressure when large imports are done without requiring the batches to be split up this way as it is currently.
s
Hi Kishore. Thank you for your response. Your inputs are helpful. In the current status do you suggest any best practice or sample implementation where the import has been implemented with large number of records without impacting the backpressure? or temporarily disabling search so the import can complete?
k
You can just set the healthy-write-lag and healthy-read-lag thresholds high so the server does not return that error
s
Thank you very much Kishore. WIll try that
j
Here’s a one-liner I sometimes use to break apart large JSONL files into smaller batches and import them into Typesense:
Copy code
parallel --block -10 -a documents.jsonl --tmpdir /tmp --pipepart --cat 'curl -H "X-TYPESENSE-API-KEY: xyz" -X POST -T {} <http://localhost:8108/collections/companies/documents/import?action=create>'
run 1
a
Wow. This is a great idea. if you could have suggested me few days earlier 😅 . I uploaded around 200k records and it took around 20-30 minutes 🤦.
j
Ha! I just remembered this today and dug up the snippet from my notes. I should probably add this to the docs!
In your case though, your cluster had also run out of RAM, so that’s why it took a while to index
👍 1
s
Thank you Jason. This will be very useful, if this works then i dont need to meddle with the server default config.
👍 1
Jason & Kishore, I have one more question. What should be the next action when i get a ""message": "Not Ready or Lagging" or " and the API stats showing ""pending_write_batches": 675,". I left it as is for over 12hours but the status is still the same status. In such cases what is the next step you suggest? I have been restarting the docker. But is there any command or restart/reset that can be done to get back to be able to load more data?
k
It should not be stuck in that state. It should progress. If it's not then it's likely a bug. Are you able to reproduce this isssue consistently?
s
Yes. It happened 3 times.
k
Are you able to share the dataset with us so I can see if I can reproduce?
s
i will try to. This is customer data so will have to do some masking.
But at a high level below is the structure of the data i am trying to load
"name": "AdditionalContacts", "fields": [ {"name": "id", "type": "string" }, {"name": "kol_id", "type": "string" }, {"name": "master_customer_id", "type": "string" }, {"name": "master_customer_location_id", "type": "string" }, {"name": "title", "type": "string" ,"facet": true }, {"name": "first_name", "type": "string" }, {"name": "middle_name", "type": "string" }, {"name": "last_name", "type": "string" }, {"name": "full_name", "type": "string" }, {"name": "specialty", "type": "string" ,"facet": true}, {"name": "country_name", "type": "string" ,"facet": true}, {"name": "state_name", "type": "string" ,"facet": true}, {"name": "city_name", "type": "string" ,"facet": true}, {"name": "postal_code", "type": "string*" ,"facet": true}, {"name": "address_line_1", "type": "string" }, {"name": "npi", "type": "string" }, {"name": "customer_type", "type": "int32" } ]
Since it contains the address and location i am not sure if its a good idea to share in this forum.
k
You can see if you can maybe mask data and still reproduce the issue? You can also DM me or email us privately.
How many records are you importing?
s
I will try to Mask the data and send
k
Verify you are able to reproduce issue after masking.
s
When i try to load upto 1lakh records its smooth. WHen i try with 5lakhs records in one JSON file i see this happening.
ok. will do
k
Are you sure that docker has enough memory?
s
yes. The Server has 32GB and Docker run is created with default so no limitation
k
I think default docker container has only 2G memory
You can check via docker stats command
s
Ok thanks. Will check that.
Docker Stats shows the Limit is 32GB
k
Any other error logs produced by Typesense right before the pending write batches value becomes stuck and static?
s
in the commandline where the docker was started this is the output and continues like this ... I20230314 144414.819993 191 raft_server.h:60] Peer refresh succeeded! E20230314 144423.821457 167 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230314 144424.821669 167 raft_server.cpp:545] Term: 2, last_index index: 888, committed_index: 888, known_applied_index: 888, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353275 I20230314 144424.821827 191 raft_server.h:60] Peer refresh succeeded! E20230314 144432.823074 167 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230314 144434.823442 167 raft_server.cpp:545] Term: 2, last_index index: 888, committed_index: 888, known_applied_index: 888, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353275 I20230314 144434.823505 191 raft_server.h:60] Peer refresh succeeded!
is there any log you want me to pull?
k
Yes that's the one. What happens just before that number gets stuck
s
i will try to reprodce the issue and get the logs to you
Here is also something interesting. I stopped the docker container using docker stop {containerid}
Then started the container again with docker run. now that typesense instance has started, i still see the same error.
I20230315 080242.905923 194 raft_server.h:60] Peer refresh succeeded! E20230315 080244.906180 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080252.907020 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326 I20230315 080252.907131 195 raft_server.h:60] Peer refresh succeeded! E20230315 080253.907241 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080302.908504 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326 E20230315 080302.908579 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080302.908634 190 raft_server.h:60] Peer refresh succeeded! E20230315 080311.910018 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080312.910221 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326 I20230315 080312.910256 196 raft_server.h:60] Peer refresh succeeded! E20230315 080320.911703 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080322.912124 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326 I20230315 080322.912271 194 raft_server.h:60] Peer refresh succeeded! I20230315 080327.124313 165 batched_indexer.cpp:279] Running GC for aborted requests, req map size: 1 E20230315 080329.913832 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500 I20230315 080332.914208 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
it continues with the error about 675 queued writes
I think the only way is to stop the container, delete the typesense_data folder and then start docker again to exit this issue.
k
Yes it is resuming the container. Need to start from scratch.
s
Hi Kishore. I tried various scenarios a couple of times today. And almost always able to reproduce the issue where Typesense server get stuck and probably run in loops. Using the parallel command suggested by Jason helped to import data upto about 5lakh records. Beyond that it kind of gets stuck and does not let additional data to be added to the collection.
About your question on what happens just before it gets stuck. below I was anble to pull this info. Not sure if its useful
I20230315 142003.943729 204 log.cpp:537] Renamed `/data/state/log/log_inprogress_00000000000000000265' to `/data/state/log/log_00000000000000000265_00000000000000000271' I20230315 142003.943948 204 log.cpp:108] Created new segment `/data/state/log/log_inprogress_00000000000000000272' with fd=74 I20230315 142004.604780 166 raft_server.cpp:545] Term: 2, last_index index: 274, committed_index: 273, known_applied_index: 273, applying_index: 0, queued_writes: 17, pending_queue_size: 1, local_sequence: 1554680 I20230315 142004.604878 203 raft_server.h:60] Peer refresh succeeded! I20230315 142006.574550 206 log.cpp:523] close a full segment. Current first_index: 272 last_index: 278 raft_sync_segments: 0 will_sync: 1 path: /data/state/log/log_00000000000000000272_00000000000000000278 I20230315 142006.574635 206 log.cpp:537] Renamed `/data/state/log/log_inprogress_00000000000000000272' to `/data/state/log/log_00000000000000000272_00000000000000000278' I20230315 142006.574759 206 log.cpp:108] Created new segment `/data/state/log/log_inprogress_00000000000000000279' with fd=38 I20230315 142014.605857 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001 I20230315 142014.605947 204 raft_server.h:60] Peer refresh succeeded! I20230315 142024.607048 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001 I20230315 142024.607146 208 raft_server.h:60] Peer refresh succeeded! I20230315 142031.222904 167 batched_indexer.cpp:279] Running GC for aborted requests, req map size: 3 I20230315 142034.608918 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001 I20230315 142034.609088 206 raft_server.h:60] Peer refresh succeeded! I20230315 142044.610810 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001 I20230315 142044.610908 203 raft_server.h:60] Peer refresh succeeded!
Could it be something that is related to the size of the index or other factors that might be impacting the import?
I am trying to see if I can have Typesense installed on a Linux machine directly to rule out any docker realted config issues that i am not able to capture or identify.
k
Those logs look fine to me. We actually have many users indexing several millions of documents effortlessly so I'm really curious to know what's going wrong here. Trying on Linux directly is a good idea.
s
Hi Kishore. Was able to install typesense on a physical Linux system using DEB and then tried to import the same file. Fortunately this time it did not get stuck in that loop. But it did stop. and based on the /var/log/typesense logs it looks like parsing issue. Below is error
E20230316 093448.871594 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 414: syntax error while parsing object key - invalid string: control character U+0002 (STX) must be escaped to \u0002; last read: '"address_line_1t#BC<U+0002>'; expected string literal E20230316 093450.361044 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 91: syntax error while parsing object key - invalid string: forbidden character after backslash; last read: '"master_custom\3'; expected string literal E20230316 093450.435761 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 229: syntax error while parsing value - invalid string: control character U+000D (CR) must be escaped to \u000D or \r; last read: '"PURNe|<U+000D>' E20230316 093450.513315 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 449: syntax error while parsing object key - invalid string: control character U+0010 (DLE) must be escaped to \u0010; last read: '"customer_s*V<U+0010>'; expected string literal E20230316 093450.571502 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 477: syntax error while parsing object separator - unexpected ','; expected ':' E20230316 093450.623791 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 156: syntax error while parsing object key - invalid string: control character U+0018 (CAN) must be escaped to \u0018; last read: '"first_namYs'w<U+0018>'; expected string literal E20230316 093451.012920 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 345: syntax error while parsing value - invalid literal; last read: '"city_name":>' E20230316 093451.444113 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 400: syntax error while parsing object key - invalid string: control character U+0013 (DC3) must be escaped to \u0013; last read: '"address_line<U+0013>'; expected string literal I20230316 093458.492647 1284 raft_server.cpp:545] Term: 5, last_index index: 348, committed_index: 348, known_applied_index: 348, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 1745884 I20230316 093458.492774 1320 raft_server.h:60] Peer refresh succeeded!
This data came from Structured MariaDB DB. Exported to a CSV and then converted to JSONL using mlr like suggested in your docs. Not sure where the special chars got inserted, That said based on the above errors is there an easy way to find the lines / rows of doc where these special chars are? e.g: U+0002 (STX), U+0013 (DC3) ettc
or do you recomend using the coerce_or_drop and hope it takes are of indexing other records apart for the ones with issue?
k
What client are you using to import?
Each document in the import file should have an associated error message in API response.
s
shell.
k
Nothing in curl response?
s
nope. It abruptly stopped .
{"success":true} {"success":true} % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 7100k 0 244k 100 6855k 26994 739k 00009 00009 -- -- 61596 ETA: 160s Left: 3 AVG: 58.22s local:3/37/100%/58.3s
hence only logs i found was in the typesense log
E.g . Here is a error from the logs. E20230316 093434.579699 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 473: syntax error while parsing value - invalid string: control character U+0006 (ACK) must be escaped to \u0006; last read: '"*N0222`**e<U+0006>'
when i search for the value N0222 in the source JSONL file below is the record i find. and dont find any issue with the data.
Copy code
{
  "id": "286452",
  "kol_id": "62930",
  "master_customer_id": "MC0001242543",
  "master_customer_location_id": "MCL1015119634",
  "title": "NUR",
  "suffix": "",
  "first_name": "Charlotte",
  "middle_name": "F",
  "last_name": "Hinson",
  "full_name": "Hinson, Charlotte F",
  "specialty": "",
  "org_name": "",
  "country_name": "United States",
  "state_name": "Florida",
  "city_name": "Daytona Beach",
  "postal_code": "32114",
  "address_line_1": "551 NATIONAL HEALTH CARE DR",
  "npi": "N022237137",
  "customer_type": 0
}
k
Does the curl complete? Ideally you should see success: false and error message in curl output
Maybe isolate that record without copy pasting directly (delete above and below) and then try importing.
s
This is how the curl Parallel import job ends. No error. Hence i looked in the logs for any clues of why it stopped
k
Try only the file with that apparent bad record?
Without parallel
s
ok. Will try that.
tried without parallel. Same stopped exeactly at the same point. and curl did not have any error. Logs show parse error
message has been deleted
log file error
I20230316 145710.372087 1320 raft_server.h:60] Peer refresh succeeded! E20230316 145719.214854 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 224: syntax error while parsing object key - invalid string: control character U+0000 (NUL) must be escaped to \u0000; last read: '"full_nam<U+0000>'; expected string literal I20230316 145720.373448 1284 raft_server.cpp:545] Term: 5, last_index index: 647, committed_index: 647, known_applied_index: 647, applying_index: 0, queued_writes: 132, pending_queue_size: 0, local_sequence: 3796967 I20230316 145720.373706 1320 raft_server.h:60] Peer refresh succeeded! E20230316 145722.755046 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 336: syntax error while parsing object separator - invalid literal; last read: '"state_name"='; expected ':' E20230316 145722.910374 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 310: syntax error while parsing object - invalid literal; last read: '"United States"<U+0017>'; expected '}' E20230316 145724.886736 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 100: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"master_customer_locatio-<U+001F>'; expected string literal E20230316 145726.560950 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 446: syntax error while parsing object key - invalid string: control character U+001E (RS) must be escaped to \u001E; last read: '"npiw'<U+001E>'; expected string literal E20230316 145727.054530 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 99: syntax error while parsing object key - invalid string: control character U+0017 (ETB) must be escaped to \u0017; last read: '"master_customer_locatio<U+0017>'; expected string literal
k
Good, that's progress! With this, I think you can further narrow down into the bad record. Once you do that, if you can share that file (maybe masking out other text if needed), then I can try and see why we are not failing gracefully.
s
great. Thanks
Here is some quick analysis based on the error
Error: E20230316 145735.250784 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 457: syntax error while parsing value - invalid string: control character U+000E (SO) must be escaped to \u000E; last read: '"3602 MATLOCKXVM@h.Gʅ<U+000E>'
k
Btw, you can send
?return_id=true
parameter to import so that the
success
line also has the
id
of the document being imported -- this way we can see where it stops
s
i did not, will try that today.
k
But it seems like the error happens with record containing string
3602 MATLOCKXVM
s
I opened the file in Notepad++ and searched for 3602 MATLOCK. I found the below 2 records. but they seem to be fine. and dont see any special chars
Records
Copy code
{
  "id": "472833",
  "kol_id": "49833",
  "master_customer_id": "MC0000045163",
  "master_customer_location_id": "MCL1015798917",
  "title": "MD",
  "suffix": "",
  "first_name": "Julie",
  "middle_name": "Ross",
  "last_name": "Pittman",
  "full_name": "Pittman, Julie Ross",
  "specialty": "ADDICTION PSYCHIATRY",
  "org_name": "",
  "country_name": "United States",
  "state_name": "Texas",
  "city_name": "Arlington",
  "postal_code": "76015",
  "address_line_1": "3602 MATLOCK RD STE 210",
  "npi": "NULL",
  "customer_type": 0
}
--
Copy code
{
  "id": "754264",
  "kol_id": "399761",
  "master_customer_id": "MC0000238626",
  "master_customer_location_id": "NULL",
  "title": "",
  "suffix": "",
  "first_name": "HAMID",
  "middle_name": "",
  "last_name": "BURNEY",
  "full_name": "BURNEY, HAMID ",
  "specialty": "NULL",
  "org_name": "",
  "country_name": "United States",
  "state_name": "Texas",
  "city_name": "ARLINGTON",
  "postal_code": "76015",
  "address_line_1": "3602 MATLOCK RD",
  "npi": "1780630897",
  "customer_type": 0
}
k
Most text editors can handle bad unicode -- won't be able to see it with plain eyes. And often copy pasting will also "fix" the issue.
s
oh ok.
Got it.
k
So the trick is to take the original file and then delete lines before and after problematic suspected record (with some buffer of few records before and after) and then to save that file and try with that.
s
Cool WIll try that approach. Thanks agian Kishore. You have been very helpful.
👍 1
q
k
Hi @Suraj Prabhu, any luck in narrowing this down?
s
HI @Kishore Nallan Looks like it was data quality issue. I had the team extract the data again and this time I was able to convert the CSV to JSONL using mlr and the load 28Lakh records in 3 batchs 10lakhs, 5 Lakhs and 13 Lakhs. All uploads were smooth and were fast too. I did not even have to use the Parallel load options. Thank you again for all your help. This is the version of TypeSense I installed on the Ubunutu server. The Docker version i have to tried to load with the fresh data. But my current setup is good enough for me to continue more R&D before we plan to use with customer instance.
k
Good to hear, thanks for the update.