#random

Large JSONL Documents Import Issue & Resolution

TLDR Suraj was having trouble loading large JSONL documents into Typesense server. After several discussions and attempts, it was discovered that the issue was due to data quality. Once the team extracted the data again, the upload process worked smoothly.

Powered by Struct AI

3

run

1

Mar 14, 2023 (9 months ago)
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:48 PM
But the issue is only a part of the data about 50k Documents are loaded into the collection. And in the APIs i also see
{
"message": "Not Ready or Lagging"
}
Can you please help what would be the best option to load large number of documents without having to break it down in to lot of smaller JSONL files. Because if i load files with about 20k documents it loads fast and smooth.

Is there any server config or setting that i am missing? Below is the CURL command i am sending

curl "${TYPESENSE_HOST}/collections/AdditionalContacts/documents/import?batch_size=1000" -X POST -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -T additional_contacts_10l-1499999.jsonl | jq
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:50 PM
When you send a large number of data in a single batch, there is a back pressure mechanism to prevent the server from getting overhwlemed because that will affect searches.
02:50
Kishore Nallan
02:50 PM
We will have to see if we can support automatic backpressure when large imports are done without requiring the batches to be split up this way as it is currently.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:53 PM
Hi Kishore. Thank you for your response. Your inputs are helpful. In the current status do you suggest any best practice or sample implementation where the import has been implemented with large number of records without impacting the backpressure? or temporarily disabling search so the import can complete?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:54 PM
You can just set the healthy-write-lag and healthy-read-lag thresholds high so the server does not return that error
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:57 PM
Thank you very much Kishore. WIll try that
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:21 PM
Here’s a one-liner I sometimes use to break apart large JSONL files into smaller batches and import them into Typesense:

parallel --block -10 -a documents.jsonl --tmpdir /tmp --pipepart --cat 'curl -H "X-TYPESENSE-API-KEY: xyz" -X POST -T {} '
run

1

Atul
Photo of md5-b28e2b6cff9ce32901c5ee87b38aecd5
Atul
05:03 PM
Wow. This is a great idea. if you could have suggested me few days earlier 😅 .
I uploaded around 200k records and it took around 20-30 minutes :face_palm:.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:06 PM
Ha! I just remembered this today and dug up the snippet from my notes. I should probably add this to the docs!
05:07
Jason
05:07 PM
In your case though, your cluster had also run out of RAM, so that’s why it took a while to index

1

Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
05:53 PM
Thank you Jason. This will be very useful, if this works then i dont need to meddle with the server default config.

1

Mar 15, 2023 (9 months ago)
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:38 AM
Jason & Kishore,
I have one more question. What should be the next action when i get a ""message": "Not Ready or Lagging" or " and the API stats showing ""pending_write_batches": 675,". I left it as is for over 12hours but the status is still the same status.

In such cases what is the next step you suggest? I have been restarting the docker. But is there any command or restart/reset that can be done to get back to be able to load more data?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:40 AM
It should not be stuck in that state. It should progress. If it's not then it's likely a bug. Are you able to reproduce this isssue consistently?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:41 AM
Yes. It happened 3 times.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:42 AM
Are you able to share the dataset with us so I can see if I can reproduce?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:43 AM
i will try to. This is customer data so will have to do some masking.
07:43
Suraj
07:43 AM
But at a high level below is the structure of the data i am trying to load
07:43
Suraj
07:43 AM
"name": "AdditionalContacts",
"fields": [
{"name": "id", "type": "string" },
{"name": "kol_id", "type": "string" },
{"name": "master_customer_id", "type": "string" },
{"name": "master_customer_location_id", "type": "string" },
{"name": "title", "type": "string" ,"facet": true },
{"name": "first_name", "type": "string" },
{"name": "middle_name", "type": "string" },
{"name": "last_name", "type": "string" },
{"name": "full_name", "type": "string" },
{"name": "specialty", "type": "string" ,"facet": true},
{"name": "country_name", "type": "string" ,"facet": true},
{"name": "state_name", "type": "string" ,"facet": true},
{"name": "city_name", "type": "string" ,"facet": true},
{"name": "postal_code", "type": "string" ,"facet": *true},
{"name": "address_line_1", "type": "string" },
{"name": "npi", "type": "string" },
{"name": "customer_type", "type": "int32" }
]
07:44
Suraj
07:44 AM
Since it contains the address and location i am not sure if its a good idea to share in this forum.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:49 AM
You can see if you can maybe mask data and still reproduce the issue? You can also DM me or email us privately.
07:49
Kishore Nallan
07:49 AM
How many records are you importing?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:50 AM
I will try to Mask the data and send
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:50 AM
Verify you are able to reproduce issue after masking.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:50 AM
When i try to load upto 1lakh records its smooth. WHen i try with 5lakhs records in one JSON file i see this happening.
07:51
Suraj
07:51 AM
ok. will do
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:51 AM
Are you sure that docker has enough memory?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:51 AM
yes. The Server has 32GB and Docker run is created with default so no limitation
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:52 AM
I think default docker container has only 2G memory
07:52
Kishore Nallan
07:52 AM
You can check via docker stats command
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:52 AM
Ok thanks. Will check that.
07:55
Suraj
07:55 AM
Docker Stats shows the Limit is 32GB
Image 1 for Docker Stats shows the Limit is 32GB
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:56 AM
Any other error logs produced by Typesense right before the pending write batches value becomes stuck and static?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
07:58 AM
in the commandline where the docker was started this is the output and continues like this ...

I20230314 14:44:14.819993 191 raft_server.h:60] Peer refresh succeeded!
E20230314 14:44:23.821457 167 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230314 14:44:24.821669 167 raft_server.cpp:545] Term: 2, last_index index: 888, committed_index: 888, known_applied_index: 888, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353275
I20230314 14:44:24.821827 191 raft_server.h:60] Peer refresh succeeded!
E20230314 14:44:32.823074 167 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230314 14:44:34.823442 167 raft_server.cpp:545] Term: 2, last_index index: 888, committed_index: 888, known_applied_index: 888, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353275
I20230314 14:44:34.823505 191 raft_server.h:60] Peer refresh succeeded!
07:59
Suraj
07:59 AM
is there any log you want me to pull?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:02 AM
Yes that's the one. What happens just before that number gets stuck
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
08:02 AM
i will try to reprodce the issue and get the logs to you
08:04
Suraj
08:04 AM
Here is also something interesting.
I stopped the docker container using docker stop {containerid}
08:04
Suraj
08:04 AM
Then started the container again with docker run. now that typesense instance has started, i still see the same error.
08:05
Suraj
08:05 AM
I20230315 08:02:42.905923 194 raft_server.h:60] Peer refresh succeeded!
E20230315 08:02:44.906180 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:02:52.907020 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
I20230315 08:02:52.907131 195 raft_server.h:60] Peer refresh succeeded!
E20230315 08:02:53.907241 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:03:02.908504 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
E20230315 08:03:02.908579 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:03:02.908634 190 raft_server.h:60] Peer refresh succeeded!
E20230315 08:03:11.910018 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:03:12.910221 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
I20230315 08:03:12.910256 196 raft_server.h:60] Peer refresh succeeded!
E20230315 08:03:20.911703 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:03:22.912124 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
I20230315 08:03:22.912271 194 raft_server.h:60] Peer refresh succeeded!
I20230315 08:03:27.124313 165 batched_indexer.cpp:279] Running GC for aborted requests, req map size: 1
E20230315 08:03:29.913832 164 raft_server.cpp:635] 675 queued writes > healthy write lag of 500
I20230315 08:03:32.914208 164 raft_server.cpp:545] Term: 3, last_index index: 906, committed_index: 906, known_applied_index: 906, applying_index: 0, queued_writes: 675, pending_queue_size: 0, local_sequence: 353326
08:05
Suraj
08:05 AM
it continues with the error about 675 queued writes
08:06
Suraj
08:06 AM
I think the only way is to stop the container, delete the typesense_data folder and then start docker again to exit this issue.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:11 AM
Yes it is resuming the container. Need to start from scratch.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:37 PM
Hi Kishore. I tried various scenarios a couple of times today. And almost always able to reproduce the issue where Typesense server get stuck and probably run in loops. Using the parallel command suggested by Jason helped to import data upto about 5lakh records. Beyond that it kind of gets stuck and does not let additional data to be added to the collection.
02:38
Suraj
02:38 PM
About your question on what happens just before it gets stuck. below I was anble to pull this info. Not sure if its useful
02:39
Suraj
02:39 PM
I20230315 14:20:03.943729 204 log.cpp:537] Renamed /data/state/log/log_inprogress_00000000000000000265' to /data/state/log/log_00000000000000000265_00000000000000000271'
I20230315 14:20:03.943948 204 log.cpp:108] Created new segment `/data/state/log/log_inprogress_00000000000000000272' with fd=74
I20230315 14:20:04.604780 166 raft_server.cpp:545] Term: 2, last_index index: 274, committed_index: 273, known_applied_index: 273, applying_index: 0, queued_writes: 17, pending_queue_size: 1, local_sequence: 1554680
I20230315 14:20:04.604878 203 raft_server.h:60] Peer refresh succeeded!
I20230315 14:20:06.574550 206 log.cpp:523] close a full segment. Current first_index: 272 last_index: 278 raft_sync_segments: 0 will_sync: 1 path: /data/state/log/log_00000000000000000272_00000000000000000278
I20230315 14:20:06.574635 206 log.cpp:537] Renamed /data/state/log/log_inprogress_00000000000000000272' to /data/state/log/log_00000000000000000272_00000000000000000278'
I20230315 14:20:06.574759 206 log.cpp:108] Created new segment `/data/state/log/log_inprogress_00000000000000000279' with fd=38
I20230315 14:20:14.605857 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001
I20230315 14:20:14.605947 204 raft_server.h:60] Peer refresh succeeded!
I20230315 14:20:24.607048 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001
I20230315 14:20:24.607146 208 raft_server.h:60] Peer refresh succeeded!
I20230315 14:20:31.222904 167 batched_indexer.cpp:279] Running GC for aborted requests, req map size: 3
I20230315 14:20:34.608918 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001
I20230315 14:20:34.609088 206 raft_server.h:60] Peer refresh succeeded!
I20230315 14:20:44.610810 166 raft_server.cpp:545] Term: 2, last_index index: 282, committed_index: 282, known_applied_index: 282, applying_index: 0, queued_writes: 18, pending_queue_size: 0, local_sequence: 1611001
I20230315 14:20:44.610908 203 raft_server.h:60] Peer refresh succeeded!
02:41
Suraj
02:41 PM
Could it be something that is related to the size of the index or other factors that might be impacting the import?
02:42
Suraj
02:42 PM
I am trying to see if I can have Typesense installed on a Linux machine directly to rule out any docker realted config issues that i am not able to capture or identify.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:44 PM
Those logs look fine to me. We actually have many users indexing several millions of documents effortlessly so I'm really curious to know what's going wrong here. Trying on Linux directly is a good idea.
Mar 16, 2023 (9 months ago)
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:05 PM
Hi Kishore. Was able to install typesense on a physical Linux system using DEB and then tried to import the same file. Fortunately this time it did not get stuck in that loop. But it did stop. and based on the /var/log/typesense logs it looks like parsing issue. Below is error
02:06
Suraj
02:06 PM
E20230316 09:34:48.871594 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 414: syntax error while parsing object key - invalid string: control character U+0002 (STX) must be escaped to \u0002; last read: '"address_line_1t#BC<U+0002>'; expected string literal
E20230316 09:34:50.361044 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 91: syntax error while parsing object key - invalid string: forbidden character after backslash; last read: '"master_custom\3'; expected string literal
E20230316 09:34:50.435761 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 229: syntax error while parsing value - invalid string: control character U+000D (CR) must be escaped to \u000D or \r; last read: '"PURNe|<U+000D>'
E20230316 09:34:50.513315 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 449: syntax error while parsing object key - invalid string: control character U+0010 (DLE) must be escaped to \u0010; last read: '"customer_s*V<U+0010>'; expected string literal
E20230316 09:34:50.571502 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 477: syntax error while parsing object separator - unexpected ','; expected ':'
E20230316 09:34:50.623791 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 156: syntax error while parsing object key - invalid string: control character U+0018 (CAN) must be escaped to \u0018; last read: '"first_namYs'w<U+0018>'; expected string literal
E20230316 09:34:51.012920 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 345: syntax error while parsing value - invalid literal; last read: '"city_name":>'
E20230316 09:34:51.444113 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 400: syntax error while parsing object key - invalid string: control character U+0013 (DC3) must be escaped to \u0013; last read: '"address_line<U+0013>'; expected string literal
I20230316 09:34:58.492647 1284 raft_server.cpp:545] Term: 5, last_index index: 348, committed_index: 348, known_applied_index: 348, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 1745884
I20230316 09:34:58.492774 1320 raft_server.h:60] Peer refresh succeeded!
02:09
Suraj
02:09 PM
This data came from Structured MariaDB DB. Exported to a CSV and then converted to JSONL using mlr like suggested in your docs. Not sure where the special chars got inserted, That said based on the above errors is there an easy way to find the lines / rows of doc where these special chars are? e.g: U+0002 (STX), U+0013 (DC3) ettc
02:10
Suraj
02:10 PM
or do you recomend using the coerce_or_drop and hope it takes are of indexing other records apart for the ones with issue?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:28 PM
What client are you using to import?
02:28
Kishore Nallan
02:28 PM
Each document in the import file should have an associated error message in API response.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:28 PM
shell.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:29 PM
Nothing in curl response?
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:30 PM
nope. It abruptly stopped .
02:30
Suraj
02:30 PM
{"success":true}
{"success":true} % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7100k 0 244k 100 6855k 26994 739k 0:00:09 0:00:09 --:--:-- 61596
ETA: 160s Left: 3 AVG: 58.22s local:3/37/100%/58.3s
02:30
Suraj
02:30 PM
hence only logs i found was in the typesense log
02:41
Suraj
02:41 PM
E.g . Here is a error from the logs.
E20230316 09:34:34.579699 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 473: syntax error while parsing value - invalid string: control character U+0006 (ACK) must be escaped to \u0006; last read: '"N0222`*e<U+0006>'
02:42
Suraj
02:42 PM
when i search for the value N0222 in the source JSONL file below is the record i find. and dont find any issue with the data.
02:42
Suraj
02:42 PM
{"id": "286452", "kol_id": "62930", "master_customer_id": "MC0001242543", "master_customer_location_id": "MCL1015119634", "title": "NUR", "suffix": "", "first_name": "Charlotte", "middle_name": "F", "last_name": "Hinson", "full_name": "Hinson, Charlotte F", "specialty": "", "org_name": "", "country_name": "United States", "state_name": "Florida", "city_name": "Daytona Beach", "postal_code": "32114", "address_line_1": "551 NATIONAL HEALTH CARE DR", "npi": "N022237137", "customer_type": 0}
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:45 PM
Does the curl complete? Ideally you should see success: false and error message in curl output
02:46
Kishore Nallan
02:46 PM
Maybe isolate that record without copy pasting directly (delete above and below) and then try importing.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:47 PM
This is how the curl Parallel import job ends. No error. Hence i looked in the logs for any clues of why it stopped
Image 1 for This is how the curl Parallel import job ends. No error. Hence i looked in the logs for any clues of why it stopped
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:49 PM
Try only the file with that apparent bad record?
02:49
Kishore Nallan
02:49 PM
Without parallel
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
02:50 PM
ok. Will try that.
03:31
Suraj
03:31 PM
tried without parallel. Same stopped exeactly at the same point. and curl did not have any error. Logs show parse error
03:31
Suraj
03:31 PM
Image 1 for
03:32
Suraj
03:32 PM
log file error
03:32
Suraj
03:32 PM
I20230316 14:57:10.372087 1320 raft_server.h:60] Peer refresh succeeded!
E20230316 14:57:19.214854 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 224: syntax error while parsing object key - invalid string: control character U+0000 (NUL) must be escaped to \u0000; last read: '"full_nam<U+0000>'; expected string literal
I20230316 14:57:20.373448 1284 raft_server.cpp:545] Term: 5, last_index index: 647, committed_index: 647, known_applied_index: 647, applying_index: 0, queued_writes: 132, pending_queue_size: 0, local_sequence: 3796967
I20230316 14:57:20.373706 1320 raft_server.h:60] Peer refresh succeeded!
E20230316 14:57:22.755046 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 336: syntax error while parsing object separator - invalid literal; last read: '"state_name"='; expected ':'
E20230316 14:57:22.910374 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 310: syntax error while parsing object - invalid literal; last read: '"United States"<U+0017>'; expected '}'
E20230316 14:57:24.886736 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 100: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"master_customer_locatio-<U+001F>'; expected string literal
E20230316 14:57:26.560950 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 446: syntax error while parsing object key - invalid string: control character U+001E (RS) must be escaped to \u001E; last read: '"npiw'<U+001E>'; expected string literal
E20230316 14:57:27.054530 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 99: syntax error while parsing object key - invalid string: control character U+0017 (ETB) must be escaped to \u0017; last read: '"master_customer_locatio<U+0017>'; expected string literal
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:34 PM
Good, that's progress! With this, I think you can further narrow down into the bad record. Once you do that, if you can share that file (maybe masking out other text if needed), then I can try and see why we are not failing gracefully.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
03:34 PM
great. Thanks
03:35
Suraj
03:35 PM
Here is some quick analysis based on the error
03:35
Suraj
03:35 PM
Error: E20230316 14:57:35.250784 1287 collection.cpp:74] JSON error: [json.exception.parse_error.101] parse error at line 1, column 457: syntax error while parsing value - invalid string: control character U+000E (SO) must be escaped to \u000E; last read: '"3602 [email protected]ʅ<U+000E>'
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:35 PM
Btw, you can send ?return_id=true parameter to import so that the success line also has the id of the document being imported -- this way we can see where it stops
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
03:36 PM
i did not, will try that today.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:36 PM
But it seems like the error happens with record containing string 3602 MATLOCKXVM
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
03:36 PM
I opened the file in Notepad++ and searched for 3602 MATLOCK. I found the below 2 records. but they seem to be fine. and dont see any special chars
03:37
Suraj
03:37 PM
Records
03:37
Suraj
03:37 PM
{"id": "472833", "kol_id": "49833", "master_customer_id": "MC0000045163", "master_customer_location_id": "MCL1015798917", "title": "MD", "suffix": "", "first_name": "Julie", "middle_name": "Ross", "last_name": "Pittman", "full_name": "Pittman, Julie Ross", "specialty": "ADDICTION PSYCHIATRY", "org_name": "", "country_name": "United States", "state_name": "Texas", "city_name": "Arlington", "postal_code": "76015", "address_line_1": "3602 MATLOCK RD STE 210", "npi": "NULL", "customer_type": 0}
03:37
Suraj
03:37 PM
--
03:37
Suraj
03:37 PM
{"id": "754264", "kol_id": "399761", "master_customer_id": "MC0000238626", "master_customer_location_id": "NULL", "title": "", "suffix": "", "first_name": "HAMID", "middle_name": "", "last_name": "BURNEY", "full_name": "BURNEY, HAMID ", "specialty": "NULL", "org_name": "", "country_name": "United States", "state_name": "Texas", "city_name": "ARLINGTON", "postal_code": "76015", "address_line_1": "3602 MATLOCK RD", "npi": "1780630897", "customer_type": 0}
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:37 PM
Most text editors can handle bad unicode -- won't be able to see it with plain eyes. And often copy pasting will also "fix" the issue.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
03:37 PM
oh ok.
03:37
Suraj
03:37 PM
Got it.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:38 PM
So the trick is to take the original file and then delete lines before and after problematic suspected record (with some buffer of few records before and after) and then to save that file and try with that.
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
03:38 PM
Cool WIll try that approach. Thanks agian Kishore. You have been very helpful.

1

Mar 17, 2023 (9 months ago)
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
06:22 AM
q
Mar 19, 2023 (9 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:09 PM
Hi Suraj, any luck in narrowing this down?
Mar 21, 2023 (9 months ago)
Suraj
Photo of md5-8ac455f6d302407f03def4c775778b28
Suraj
08:03 AM
HI Kishore Nallan Looks like it was data quality issue. I had the team extract the data again and this time I was able to convert the CSV to JSONL using mlr and the load 28Lakh records in 3 batchs 10lakhs, 5 Lakhs and 13 Lakhs. All uploads were smooth and were fast too. I did not even have to use the Parallel load options.

Thank you again for all your help. This is the version of TypeSense I installed on the Ubunutu server. The Docker version i have to tried to load with the fresh data. But my current setup is good enough for me to continue more R&D before we plan to use with customer instance.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:04 AM
Good to hear, thanks for the update.

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Unresolved High-Volume Write Issue in Typesense

Greg experienced issues with Typesense where it became unresponsive during high-volume write operations. Jason and Kishore Nallan suggested several solutions, but the issue remained unresolved. They suspect that the problem occurs when concurrent writes are happening to the same collection.

2

43
4d

Troubleshooting Stalled Writes in TypeSense Instance

Robert was experiencing typesense instances getting stuck after trying to import documents. Kishore Nallan provided suggestions and added specific logs to diagnose the issue. The two identified queries causing troubles but the issues had not been fully resolved yet.

7

57
13mo

Troubleshooting Typesense 503 Errors and Usage Queries

Kevin encountered 503s using typesense. Jason asked for logs and explained why 503s occur. They made recommendations to remedy the issue and resolved Kevin's import parameter confusion. User was asked to open a github issue for accepting booleans.

2

18
5mo

Issue Resolution and Upgrade Problems in Typesense Version 0.26rc16

Ankit reported an issue with Typesense, which was addressed by Kishore Nallan and Jason. However, Ankit experienced difficulties while trying to upgrade, with the server status showing as "Not ready or lagging" 503. The resolution remains incomplete.

1

19
2mo

Addressing High CPU Usage in Typesense

Robert reported high CPU usage on Typesense, even after halting all incoming searches. Kishore Nallan suggested logging heavy queries and increasing thread count. The issue was resolved after Robert found and truncated unusually large documents in the database.

35
14mo