When importing documents into a collection, after ...
# community-help
g
When importing documents into a collection, after about 2500 successful imports, I started to get an error saying
The server had an error while processing your request. Sorry about that!
. I suspect the error has some relation with the fact the collection has a built-in embedding field. Cluster:
v601y2x3upjea4tip
j
Hmm, that doesn’t seem like an error from Typesense…
Are you using a remote embedding service?
g
I'm importing in batches of 100. Connection timeout is set to 3 minutes. Retry interval is set to 5 seconds.
Yes, OpenAI.
j
Could you share the full JSON message that Typesense returns?
I suspect it’s from OpenAI’s API that we just proxy through
g
Copy code
"error": {
  "message": "The server had an error while processing your request. Sorry about that!",
  "type": "server_error",
  "param": null,
  "code": null
}
Yeah, I think it's from OpenAI. Maybe exceeding the rate limit or something.
j
We should probably indicate where the error is originating from, in cases like this
g
And maybe also retry the request to OpenAI's API when it makes sense if it's not already retrying.
j
We didn’t add a retry built-in to Typesense for remote services, to prevent any potential (billing) surprises. So if you see an error message in the API response from Typesense, you want to retry the import on those docs
g
Makes sense
Although...
Importing with
action: upsert
doesn't work because of that error where Typesense sends an empty string to OpenAI.
So I guess I'll have to retry with
action: create
and just ignore errors saying the document already exists.
j
Yeah… For now 😞
We’re just about to start addressing all the reported bugs in the last week. Should have something for you to test next week, if no other surprises come up
🤞 1
g
I was trying to delete and recreate instead of upserting, but it puts too much pressure into the server and starts to constantly give me errors when dealing with thousands of documents.
j
Deleting one by one is not as performant as deleting in a batch by query
g
Can I delete using the IDs in a query? Like `id in [a, b, c, d, ...]``
In case it's not clear, I mean, I have the IDs of the documents I want to delete. So I'd need to make a query with those IDs.
j
Yup,
id:=[a, b, c, d, ...]
g
Gonna try it
How many items can I send in one query to be safe?
j
Since the parameter is sent as a query parameter, it takes a max of 2K characters
g
Ok, I'll try here and let it know
👍 1
Weirdly, sometimes I get the error
'$.input' is invalid. Please check the API reference: <https://platform.openai.com/docs/api-reference>.
even doing the workaround of deleting the document and recreating.
I was able to import 9900 documents (very fast using "delete by query") and the next batch gave me a lot of consecutive errors with that message.
It will probably work if I just restart my script skipping the successful batches, but still I'm intrigued by the error. It should only happen when updating a document, not when recreating.
And the error happened in the whole batch:
0 documents imported successfully, 100 documents failed during import.
The code:
g
I assume it's on creation.
j
I wonder if there’s some API response from OpenAI API that we’re not handling properly
Or we’re may be passing blank strings somehow
Could you give me a script like this that replicates the $.input error message: https://gist.github.com/jasonbosco/7c3432713216c378472f13e72246f46b
g
I'm afraid I won't be able to reproduce with a Bash script because the error only happens when dealing with thousands of documents, and I'm not very good at Bash.
Does a JS repro work for you?
j
Yeah JS works too
g
I'll try to reproduce, let's see
j
g
What I know is: when I had the first error, I didn't have the
delete
and the
retry
in this code. Other than that, the same code. https://typesense-community.slack.com/archives/C01P749MET0/p1686864054410659?thread_ts=1686862232.273089&amp;cid=C01P749MET0
I'm trying to reproduce here without success.
Will try one more thing.
BTW, it would help a lot to identify the issue if the error from Typesense included the request that was sent to OpenAI's API.
So, here's what I found:
1. I couldn't reproduce the
server_error
from my first message. 2. I found that the
invalid_request_error
error happening in the 100th batch is because Typesense is trying to generate the embedding from a field that's an array and is empty. 3. There's a single document like that, but the whole batch fails saying
0 documents imported successfully, 100 documents failed during import
.
Indeed, if Typesense's error included the request being sent to OpenAI, I'd probably immediately identify the invalid input being sent in that specific document.
j
Ah good idea
Added this to our todo list
👍 1
g
Or maybe not because the problematic document was the 75th in the batch, so I wouldn't necessarily read all the 100 requests/errors and notice that specific one. So one thing that could be improved would be to prevent failing the whole batch, make it give me an error only for the actually problematic document.
j
Hmm, we shouldn’t be failing the whole batch on a single document failure already… Will look into this
👍 1
g
I guess it could be an unhandled error crashing the whole thing in your code.
It's probably sending an empty string or something like that to OpenAI's API, which is the cause of the error, so maybe just assign some sort of null embedding (
[0, 0, ...]
?) in that case instead of crashing.
I mean, assign a null embedding instead of making the request to OpenAI, so it doesn't crash.
But I'm not sure if failing silently is ideal. Just writing some ideas here without too much thought.
j
So it turns out that OpenAI’s API fails the whole API call, even if one of the strings in a batch embedding request has an issue
We make one batch embedding call to OpenAI’s API for all the documents in a Typesense import API call
g
Oh, got it. So there doesn't seem to be a lot you can do. Maybe simply document that behavior.
j
In the upcoming build, we’re going to filter out all blank strings before we send it to openai, so at least that error is avoided. But if there are any other errors, yeah we have to fail the full batch on our side
g
How will the embedding field look like in that case?
j
We’ll set it to
null
👍 1