#community-help

Threading Problem During Multiple Collection Creation and Batch Insertion in Typesense

TLDR Johan has a problem with creating multiple collections and batch-inserting documents into Typesense, which is returning results from different collections. Kishore Nallan helps troubleshoot the issue and suggests a potential local race condition, which is fixed in a later build.

Powered by Struct AI
35
15mo
Solved
Join the chat
Jul 05, 2022 (16 months ago)
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:28 AM
Hi! We’re trying to evaluate using typesense in one of our applications. I’ve spent the last day working on batch upserting documents and I have a strange issue. When the application starts we generate the schema for multiple collections and create them if needed. Directly after, if no documents have been created before, we batch ingest ~1000 documents across ~20 collections with different schemas. When I try to query one of the collections I get results from different collections. The number of results seem correct, but the documents returned are incorrect. I’m running 0.23.0 locally on Mac OS and using the node client apis.

Does anyone know if there’s a potential threading issue with creating multiple collections and batch inserting documents?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:30 AM
👋 when you mean by "documents returned are incorrect" -- do you mean to say that the documents are being returned from a different collection than the one being queried?
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:31 AM
Exactly. Here’s an example response:
{
  "facet_counts": [],
  "found": 3,
  "hits": [
    {
      "document": {
        "_createdAt": 1643040319,
        "_publishedAt": 1654612352,
        "_updatedAt": 1654503876,
        "id": "8e0c59b8-5d01-5cce-b024-8648da3399d3",
        "type": "list"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1654079543,
        "_publishedAt": 1654612352,
        "_updatedAt": 1654503857,
        "id": "a0b26cc5-376a-57c2-b0ad-a3bb57ee59cd",
        "type": "list"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1643113413,
        "_publishedAt": 1654612352,
        "_updatedAt": 1654076197,
        "id": "fd7cbcf8-3b52-5567-829f-c0349b33c930",
        "type": "page"
      },
      "highlights": [],
      "text_match": 100
    }
  ],
  "out_of": 3,
  "page": 1,
  "request_params": {
    "collection_name": "list",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}
07:31
Johan
07:31 AM
I’ve added the type to the document which is the same as the collection name.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:32 AM
Do the page and list collections have the same schema?
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:35 AM
Here’s the schema for list:
{
    "created_at": 1657005437,
    "default_sorting_field": "_updatedAt",
    "fields": [
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_createdAt",
        "optional": false,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_updatedAt",
        "optional": false,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_publishedAt",
        "optional": true,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": true,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "dataset",
        "optional": false,
        "sort": false,
        "type": "string"
      }
    ],
    "name": "list",
    "num_documents": 3,
    "symbols_to_index": [],
    "token_separators": []
  }
07:35
Johan
07:35 AM
And this is for page:
07:35
Johan
07:35 AM
{
    "created_at": 1657005437,
    "default_sorting_field": "_updatedAt",
    "fields": [
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "webContent.metaTitle",
        "optional": true,
        "sort": false,
        "type": "string"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_createdAt",
        "optional": false,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_updatedAt",
        "optional": false,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "_publishedAt",
        "optional": true,
        "sort": true,
        "type": "int64"
      },
      {
        "facet": true,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "dataset",
        "optional": false,
        "sort": false,
        "type": "string"
      }
    ],
    "name": "page",
    "num_documents": 23,
    "symbols_to_index": [],
    "token_separators": []
  }
07:36
Johan
07:36 AM
Page has one more field.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:36 AM
To figure out what's going wrong, I would make a field that's non-optional and unique to list collection and see what happens when you index.
07:37
Kishore Nallan
07:37 AM
If a page document is accidentally being indexed into the list collection, then an error will be thrown.
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:37 AM
Ok. Give me a minute and I’ll try.
07:41
Johan
07:41 AM
It did not throw an error the first time I ran the migration, but the second time it gave me this: RequestMalformed: Request failed with HTTP code 400 | Server said: Field page is not part of collection schema.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:41 AM
You mean the schema migration?
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:41 AM
I added a field with the same name as the collection and set its value to the collection name at ingestion time.
07:42
Johan
07:42 AM
Sorry the second time I made a batch upsert.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:44 AM
Field X is not part of collection schema.

error message is returned as part of schema change :thinking_face:
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:44 AM
Sorry my bad. Disregard the above. Let me try again. My schema migration was the cause of that error.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:45 AM
👍
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:47 AM
By adding a unique field to the collection it now seems to work. Is schema uniqueness a requirement?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:48 AM
No, it's not... What I am trying to figure out is what was happening earlier. If indeed the wrong document was being sent to the wrong collection, now with a unique field it should be throwing an error since we have not changed anything else apart from a unique constraint.
07:51
Kishore Nallan
07:51 AM
There are only 2 explanations for the earlier behavior:

a) Either a client side error where code erroneously sent the wrong document type to the collection.

b) Some race condition inside Typesense that sent the document to the wrong collection.

In both cases, if schema mismatch happens, an error should be thrown. So I'm surprised to see it getting indexed fine now.
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:51 AM
I’ve tried running the upsert script a couple of times now and the responses seem correct…
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:52 AM
Is it possible for you to extract the behavior (without unique field) into a standalone script that I can run to reproduce the issue?
Johan
Photo of md5-0ca37054c6c9042aa04fcfb92cc7d99c
Johan
07:52 AM
Sure I can try.
08:40
Johan
08:40 AM
I’ve managed to recreate the issue in isolation.
08:40
Johan
08:40 AM
If you run yarn run-test multiple times it will start to mix collections in response:
08:41
Johan
08:41 AM
curl -H "X-TYPESENSE-API-KEY: xyz" "<http://localhost:8108/collections/product/documents/search?q=*&amp;query_by=dataset>" | jq
08:41
Johan
08:41 AM
{
  "facet_counts": [],
  "found": 23,
  "hits": [
    {
      "document": {
        "_createdAt": 1622619587,
        "dataset": "global",
        "id": "global:8fd389b3-d63a-5a63-b616-7b3320293100",
        "type": "productEntryCategory"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1654261822,
        "dataset": "global",
        "id": "global:6504ace7-7273-5590-8ee8-b263a302d365",
        "type": "product"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1654159883,
        "dataset": "global",
        "id": "global:403d678a-0b39-5caa-8aa2-e583b3737cdb",
        "type": "product"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1622619587,
        "dataset": "global",
        "id": "global:3b385fef-0611-5178-b300-006889e071bc",
        "type": "productEntryCategory"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1622619587,
        "dataset": "global",
        "id": "global:45fdca58-958c-578a-aafa-a09e110b0af4",
        "type": "productEntryCategory"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1654088646,
        "dataset": "global",
        "id": "global:2fc86262-f5b8-5fa7-8010-240f95dae313",
        "type": "product"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1622619587,
        "dataset": "global",
        "id": "global:c7471607-589b-5d20-90e6-92011d1eb194",
        "type": "productEntryCategory"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1628508407,
        "dataset": "dataset1",
        "id": "dataset1:f82356a0-ca90-5efa-9f4c-6b58d9e35a3f",
        "type": "author"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1622620250,
        "dataset": "global",
        "id": "global:72fa23bf-a5d1-5028-ba2e-801ce8841219",
        "type": "productEntryCategory"
      },
      "highlights": [],
      "text_match": 100
    },
    {
      "document": {
        "_createdAt": 1653301312,
        "dataset": "global",
        "id": "global:c44292ef-c1ae-58a9-8df7-af01956f6149",
        "type": "product"
      },
      "highlights": [],
      "text_match": 100
    }
  ],
  "out_of": 23,
  "page": 1,
  "request_params": {
    "collection_name": "product",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}
08:44
Johan
08:44 AM
It only seems to happen when not waiting for the response from the server (eg. collectionNames.forEach(async (name) =&gt; {}) instead of for (const name of collectionNames) . The forEach statement will spawn multiple promises and not wait for the old ones to finish, but the for loop will work with async.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:21 AM
Thanks, we will look into this and keep you posted.
10:06
Kishore Nallan
10:06 AM
Johan I think this might be because of the client object being shared across all the async functions. Can you try instantiating the client object inside the async function?
Jul 19, 2022 (15 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:12 AM
Johan Were you able to figure this out?
11:18
Kishore Nallan
11:18 AM
I identified a potential race condition that could happen locally (but super rare when you connect to Typesense on another host) that I've fixed in 0.24.0.rc20 build.