#community-help

Errors in Batch Import with Typesense and OpenAI API

TLDR Gustavo encountered errors when importing documents into a collection. After discussion with Jason, it was concluded that the issue stemmed from OpenAI API's handling of batch requests with problematic documents, and improvements to Typesense's error messages and handling were suggested.

Powered by Struct AI

4

1

64
3mo
Solved
Join the chat
Jun 15, 2023 (3 months ago)
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
08:50 PM
When importing documents into a collection, after about 2500 successful imports, I started to get an error saying The server had an error while processing your request. Sorry about that!. I suspect the error has some relation with the fact the collection has a built-in embedding field. Cluster: v601y2x3upjea4tip
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:51 PM
Hmm, that doesn’t seem like an error from Typesense…
08:51
Jason
08:51 PM
Are you using a remote embedding service?
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
08:51 PM
I'm importing in batches of 100. Connection timeout is set to 3 minutes. Retry interval is set to 5 seconds.
08:51
Gustavo
08:51 PM
Yes, OpenAI.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:52 PM
Could you share the full JSON message that Typesense returns?
08:52
Jason
08:52 PM
I suspect it’s from OpenAI’s API that we just proxy through
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
08:53 PM
"error": {
  "message": "The server had an error while processing your request. Sorry about that!",
  "type": "server_error",
  "param": null,
  "code": null
}
08:54
Gustavo
08:54 PM
Yeah, I think it's from OpenAI. Maybe exceeding the rate limit or something.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:54 PM
We should probably indicate where the error is originating from, in cases like this
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
08:55 PM
And maybe also retry the request to OpenAI's API when it makes sense if it's not already retrying.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:56 PM
We didn’t add a retry built-in to Typesense for remote services, to prevent any potential (billing) surprises. So if you see an error message in the API response from Typesense, you want to retry the import on those docs
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
08:57 PM
Makes sense
08:57
Gustavo
08:57 PM
Although...
08:58
Gustavo
08:58 PM
Importing with action: upsert doesn't work because of that error where Typesense sends an empty string to OpenAI.
08:58
Gustavo
08:58 PM
So I guess I'll have to retry with action: create and just ignore errors saying the document already exists.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
08:59 PM
Yeah… For now 😞
09:00
Jason
09:00 PM
We’re just about to start addressing all the reported bugs in the last week. Should have something for you to test next week, if no other surprises come up

1

Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:00 PM
I was trying to delete and recreate instead of upserting, but it puts too much pressure into the server and starts to constantly give me errors when dealing with thousands of documents.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:01 PM
Deleting one by one is not as performant as deleting in a batch by query
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:02 PM
Can I delete using the IDs in a query? Like `id in [a, b, c, d, ...]``
09:03
Gustavo
09:03 PM
In case it's not clear, I mean, I have the IDs of the documents I want to delete. So I'd need to make a query with those IDs.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:03 PM
Yup,id:=[a, b, c, d, ...]
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:03 PM
Gonna try it
09:03
Gustavo
09:03 PM
How many items can I send in one query to be safe?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:04 PM
Since the parameter is sent as a query parameter, it takes a max of 2K characters
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:05 PM
Ok, I'll try here and let it know

1

09:17
Gustavo
09:17 PM
Weirdly, sometimes I get the error '$.input' is invalid. Please check the API reference: <https://platform.openai.com/docs/api-reference>. even doing the workaround of deleting the document and recreating.
09:18
Gustavo
09:18 PM
I was able to import 9900 documents (very fast using "delete by query") and the next batch gave me a lot of consecutive errors with that message.
09:19
Gustavo
09:19 PM
It will probably work if I just restart my script skipping the successful batches, but still I'm intrigued by the error. It should only happen when updating a document, not when recreating.
09:19
Gustavo
09:19 PM
And the error happened in the whole batch: 0 documents imported successfully, 100 documents failed during import.
09:20
Gustavo
09:20 PM
The code:
Image 1 for The code:
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:21 PM
I assume it's on creation.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:21 PM
I wonder if there’s some API response from OpenAI API that we’re not handling properly
09:21
Jason
09:21 PM
Or we’re may be passing blank strings somehow
09:22
Jason
09:22 PM
Could you give me a script like this that replicates the $.input error message: https://gist.github.com/jasonbosco/7c3432713216c378472f13e72246f46b
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:24 PM
I'm afraid I won't be able to reproduce with a Bash script because the error only happens when dealing with thousands of documents, and I'm not very good at Bash.
09:25
Gustavo
09:25 PM
Does a JS repro work for you?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:25 PM
Yeah JS works too
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:25 PM
I'll try to reproduce, let's see
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:35 PM
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
09:52 PM
What I know is: when I had the first error, I didn't have the delete and the retry in this code. Other than that, the same code.
https://typesense-community.slack.com/archives/C01P749MET0/p1686864054410659?thread_ts=1686862232.273089&amp;cid=C01P749MET0
09:52
Gustavo
09:52 PM
I'm trying to reproduce here without success.
09:53
Gustavo
09:53 PM
Will try one more thing.
09:57
Gustavo
09:57 PM
BTW, it would help a lot to identify the issue if the error from Typesense included the request that was sent to OpenAI's API.
10:29
Gustavo
10:29 PM
So, here's what I found:
10:33
Gustavo
10:33 PM
1. I couldn't reproduce the server_error from my first message.
2. I found that the invalid_request_error error happening in the 100th batch is because Typesense is trying to generate the embedding from a field that's an array and is empty.
3. There's a single document like that, but the whole batch fails saying 0 documents imported successfully, 100 documents failed during import.
10:34
Gustavo
10:34 PM
Indeed, if Typesense's error included the request being sent to OpenAI, I'd probably immediately identify the invalid input being sent in that specific document.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:35 PM
Ah good idea
10:36
Jason
10:36 PM
Added this to our todo list

1

Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
10:36 PM
Or maybe not because the problematic document was the 75th in the batch, so I wouldn't necessarily read all the 100 requests/errors and notice that specific one. So one thing that could be improved would be to prevent failing the whole batch, make it give me an error only for the actually problematic document.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:37 PM
Hmm, we shouldn’t be failing the whole batch on a single document failure already… Will look into this

1

Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
10:37 PM
I guess it could be an unhandled error crashing the whole thing in your code.
10:40
Gustavo
10:40 PM
It's probably sending an empty string or something like that to OpenAI's API, which is the cause of the error, so maybe just assign some sort of null embedding ([0, 0, ...]?) in that case instead of crashing.
10:41
Gustavo
10:41 PM
I mean, assign a null embedding instead of making the request to OpenAI, so it doesn't crash.
10:42
Gustavo
10:42 PM
But I'm not sure if failing silently is ideal. Just writing some ideas here without too much thought.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
11:38 PM
So it turns out that OpenAI’s API fails the whole API call, even if one of the strings in a batch embedding request has an issue
11:39
Jason
11:39 PM
We make one batch embedding call to OpenAI’s API for all the documents in a Typesense import API call
Jun 16, 2023 (3 months ago)
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
12:00 AM
Oh, got it. So there doesn't seem to be a lot you can do. Maybe simply document that behavior.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:01 AM
In the upcoming build, we’re going to filter out all blank strings before we send it to openai, so at least that error is avoided. But if there are any other errors, yeah we have to fail the full batch on our side
Gustavo
Photo of md5-f930fdb99fd46477205fa1201164ea50
Gustavo
12:04 AM
How will the embedding field look like in that case?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
12:06 AM
We’ll set it to null

1