#community-help

Typesense Server Bulk Import/Upsert Issue Resolved

TLDR Adam was confused about the discrepancy between the successful responses and the actual indexed data while working with a custom WP plugin integrating with Typesense. The issue was a bug related to fetching documents in the wrong order, not a Typesense problem.

Powered by Struct AI

2

Apr 26, 2023 (7 months ago)
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
06:49 PM
hi everyone - I have a question about how quickly a self-hosted typesense server can process bulk import/upsert. details in the thread
07:02
Adam
07:02 PM
I’m working on a custom WP plugin to integrate with typesense. it’s coming along fairly well, but I’m getting a strange result right now that I don’t understand. When I make a POST to index (say) posts from the database, the following steps happen
• a batch of ~1000 post IDs are identified (there are ~20k in my test db)
• they’re chunked with array_chunk into groups of 10
• I iterate over those chunks: querying data from the db, preparing the data to send, and POSTing the data to the typesense server
• each time the data are sent, I count how many successes and failures there are and report those back as part of REST response
here’s the strange part. in the browser network tab, I can see that the network response says the request of 1000 batched posts (10x100) went through. But, when I query the typesense server directly, it says that only 216 documents are in my collection.

So - is it possible that somehow there’s a race condition between how fast typesense can process the data I’m sending and how fast those data are returned by my plugin? Should I sleep part of the process to allow the typesense server to catch up?

FWIW - I’m batching things this way because otherwise my plugin kept hitting memory limits. Thanks for any help you can provide!
07:25
Adam
07:25 PM
update - I tried adding sleep(5) after each run of my loop and got the same result. no additional posts indexed
Apr 27, 2023 (7 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
12:41 AM
Are you using the import end-point? Import response will contain the status of each imported record (in the same order). You can inspect that to see if a particular record is dropped for some reason, e.g. schema mismatch.
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
11:28 AM
hi Kishore Nallan thanks for asking. yes - I’m using /collections/{collection-name}/documents/import?action=upsert&return_id=true/ I’ve also tried action=create. all the responses claim to return successfully. there’s just this mismatch between the network response and the actually indexed items. I was wondering if it might be related to this issue, but I’m not sure how to interpret the /metrics.json response
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:29 AM
Make sure you have set a large enough connection timeout value in your client. Default might be too low for large imports
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
11:30 AM
the default for my client is set to 600
11:34
Adam
11:34 AM
I can try increasing it later today to see if that helps. I haven’t quite started work yet, but wanted to check here. is there a way to stream a log of exactly what the typesense server is indexing? I notice in my docker logs that there were events like this as the items were being indexed
last_index index: 2371, committed_index: 2371, known_applied_index: 2371, applying_index: 0, queued_writes: 0, pending_queue_size: 0, local_sequence: 21719

but I wasn’t sure how to interpret them
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:41 AM
600 seconds?
11:41
Kishore Nallan
11:41 AM
That log is fine. If there is any lag, queued_writes will be non-zero.

Have you tried printing the output of the import response?
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
01:05 PM
sorry. yes, 600 seconds. as for the output of the response, I’m passing it to the response result of the request so it shows up in the client’s network response tab. but I’ve also been using vscode’s debugger. both show the same result. I have a meeting in 10 minutes, but then I can run the request again from scratch and screenshot what I’m seeing
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:07 PM
Yeah please show the raw output of the import response. You can also try importing docs via the curl client to have a secondary source of reference.
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
02:20 PM
ok. here’s what I have so far. here’s a copy of the GET request to look at the localhost-posts collection before indexing documents
{
    "facet_counts": [],
    "found": 0,
    "hits": [],
    "out_of": 0,
    "page": 1,
    "request_params": {
        "collection_name": "localhost-posts",
        "per_page": 10,
        "q": "post"
    },
    "search_cutoff": false,
    "search_time_ms": 0
}

then I’ve got two screenshots. one is the network response from the client once the POST request resolves. it’s showing the JSON response from the typesense server. the second screenshot though shows that only 216 documents got indexed
Image 1 for ok. here’s what I have so far. here’s a copy of the GET request to look at the `localhost-posts` collection _before_ indexing documents
```{
    "facet_counts": [],
    "found": 0,
    "hits": [],
    "out_of": 0,
    "page": 1,
    "request_params": {
        "collection_name": "localhost-posts",
        "per_page": 10,
        "q": "post"
    },
    "search_cutoff": false,
    "search_time_ms": 0
}```
then I’ve got two screenshots. one is the network response from the client once the POST request resolves. it’s showing the JSON response from the typesense server. the second screenshot though shows that only 216 documents got indexedImage 2 for ok. here’s what I have so far. here’s a copy of the GET request to look at the `localhost-posts` collection _before_ indexing documents
```{
    "facet_counts": [],
    "found": 0,
    "hits": [],
    "out_of": 0,
    "page": 1,
    "request_params": {
        "collection_name": "localhost-posts",
        "per_page": 10,
        "q": "post"
    },
    "search_cutoff": false,
    "search_time_ms": 0
}```
then I’ve got two screenshots. one is the network response from the client once the POST request resolves. it’s showing the JSON response from the typesense server. the second screenshot though shows that only 216 documents got indexed
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:22 PM
That does not seems to be a response from Typesense import API. We don't send data: etc.
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
02:24 PM
ah no the response in the first screenshot is what’s coming back from wordpress after it sends data to typesense
02:25
Adam
02:25 PM
this is the whole url it’s sending data to "<http://typesense:8108/collections/localhost-posts/documents/import?action=upsert&amp;return_id=true>"
02:26
Adam
02:26 PM
I’m going to see if I can create a CSV or some other format of this data and send it through postman or something. clearly something’s not adding up
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
02:28 PM
Create a file where each like is a JSON document and send it directly to the import end-point of Typesense. What is happening is that the client that is wrapping Typesense client is not correctly sending the errors back.

1

Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
02:30 PM
ok it’ll take a little time to set up, but then I’ll check back in when I’ve got a new result. thanks so much for taking time to help me debug this issue

1

03:13
Adam
03:13 PM
well I’m not sure what to make of this yet, but some of the data that got written to my test file are corrupt. when I post the file with curl, it reveals that there are some parts of the file being upserted correctly. but in other parts, there are unexpected closing curly brackets. they’re messing up both the length of the file and causing a bunch of entries to be skipped. so I think this might be more of a jsonl encoding problem than a typesense problem
03:43
Adam
03:43 PM
ok I fixed the formatting issue. this is the curl command I sent directly to typesense and its result. but, I still get the following back when I query the collection at <http://localhost:8108/collections/localhost-posts>
{
    "created_at": 1682609530,
    "default_sorting_field": "",
    "enable_nested_fields": true,
    "fields": [
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": ".*",
            "optional": true,
            "sort": false,
            "type": "auto"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "featured_image_url",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "permalink",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author",
            "optional": true,
            "sort": false,
            "type": "object"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_content",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_date",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_excerpt",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_id",
            "optional": true,
            "sort": true,
            "type": "int64"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_sortby_date",
            "optional": true,
            "sort": true,
            "type": "int64"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_title",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_type",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.user_name",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.link",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.last_name",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.full_name",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.image_url",
            "optional": true,
            "sort": false,
            "type": "string"
        },
        {
            "facet": false,
            "index": true,
            "infix": false,
            "locale": "",
            "name": "post_author.first_name",
            "optional": true,
            "sort": false,
            "type": "string"
        }
    ],
    "name": "localhost-posts",
    "num_documents": 216,
    "symbols_to_index": [],
    "token_separators": []
}
Apr 28, 2023 (7 months ago)
Adam
Photo of md5-d1ca9f479dc9860becc380edf02d7689
Adam
05:22 PM
after working on this typesense is working as expected and was working as expected the whole time. there was a bug which was fetching documents in the wrong order. many were being duplicated hence the discrepancy between the apparently successful returned data and the items actually indexed

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Troubleshooting Indexing Duration in Typesense Import

Alan asked about lengthy indexing times for importing documents to Typesense. Jason suggested various potential causes, including network connectivity and system resources. They later identified the problem to be an error in Alan's code.

5

43
15mo

Resolving Typesense v0.22.0 Import Issues

Anton encountered issues importing documents in batches using Typesense v0.22.0. Kishore Nallan suggested using atomic import and proposed a debug build. After multiple trials, they were able to reproduce and fix the issue. Anton confirmed the solution was working.

1

57
27mo

Handling Kinesis Stream Event Batching with Typesense

Dui had questions about how to handle Kinesis stream events with Typesense. Kishore Nallan suggested using upsert mode for creation/update and differentiating with logical deletion. After various discussions including identifying and resolving a bug, they finalized to introduce an `emplace` action in Typesense v0.23.

8

91
24mo

Revisiting Typesense for Efficient DB Indexing and Querying

kopach experienced slow indexing and crashes with Typesense. The community suggested to use batch import and check the server's resources. Improvements were made but additional support was needed for special characters and multi-search queries.

1

46
9mo

Threading Problem During Multiple Collection Creation and Batch Insertion in Typesense

Johan has a problem with creating multiple collections and batch-inserting documents into Typesense, which is returning results from different collections. Kishore Nallan helps troubleshoot the issue and suggests a potential local race condition, which is fixed in a later build.

35
17mo