#community-help

Syncing records issue from BigQuery to Typesense using Airbyte

TLDR Jamshid had a problem syncing records from BigQuery to Typesense via Airbyte, with only a partial number of records syncing. Jason suggested checking AirByte logs for Typesense API responses. They found an issue with BigQuery's connector on Airbyte handling repeated fields and concluded to consider building their own sync script to mitigate the situation.

Powered by Struct AI

1

Sep 29, 2023 (2 months ago)
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
09:59 PM
We are syncing records from a BigQuery table to typesense through Airbyte. There are 50551 records on BigQuery, and Airbytes says "recordsSynced" : 50551, but when we see on typesense cloud, only 11,032 records are synced.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:00 PM
The Typesense API would have returned any errors in the API response... may be the AirByte logs have the actual response from Typesense?
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
10:03 PM
Thanks for prompt response. I checked and you are right, probably these:

2023-09-29 21:56:16 replication-orchestrator > Schema validation errors found for stream xxxx. Error messages: [$.has_profile: null found, string expected, $.have_dependent_children: null found, string expected, $.country_of_residence: null found, string expected, $.meta_noc: null found, string expected, ........]

a bunch more of the exact same messaging as above.

Now I wonder why that is. Why not accept null and expecting string.

Any immediate thoughts?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:08 PM
If fields can be null, they'd need to be defined as optional: true in the collection schema
10:09
Jason
10:09 PM
That particular error message though is not from Typesense... unless the connector is reading the typesense error and is rewording it

1

Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
10:09 PM
This is what I have:

{
      "facet": false,
      "index": true,
      "infix": false,
      "locale": "",
      "name": "has_profile",
      "optional": true,
      "sort": false,
      "type": "string"
    },

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:09 PM
May be there's another schema validation that's happening in the connector
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
10:10 PM
It is of type string in BigQuery and Airbyte is detecting it as string too. Typesense as well, shows it as string. Strange.
10:11
Jamshid
10:11 PM
Actually this happened after we changed 3 of our fields from string to record data type in bigquery. Just sharing, in case somehow helps.
Image 1 for Actually this happened after we changed 3 of our fields from string to record data type in bigquery. Just sharing, in case somehow helps.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:12 PM
I see, I wonder if the AirByte connector stores state about the record types and excepts them to be the same
10:12
Jason
10:12 PM
May be try creating a new collection in Typesense, and new AirByte connector instance and then try syncing the full dataset again?
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
10:13 PM
Sure, can do that. Appreciate the help. Will write back here.
10:20
Jamshid
10:20 PM
Okay, before I do that, I excluded all the above 3 record type fields and now all 50K records are synced. Definitely something about them.
10:30
Jamshid
10:30 PM
I still see the same validation errors I pasted above, but records came as complete. That doesn’t look like the source of the issue. Any guesses on if the issue is with Airbyte or Typesense?
Oct 02, 2023 (2 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
01:00 AM
That particular phrasing of the error is not from Typesense... So I would start by looking into any additional detailed logs in AirByte
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
05:56 PM
Hi Jason there is no errors on the Airbyte logs. We didn’t have any issues two weeks ago. Were there any changes on the Cloud that may caused this? as we haven’t seen any changes on the Airbyte side.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:48 PM
> Were there any changes on the Cloud that may caused this?
We do not change Typesense versions automatically, except in very rare cases to restore cluster stability.
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
09:58 PM
Jason I just tried that through a new connection and with a different collection. It is still the same. I also downgraded my typesense version back to 0.25.1 and back to 0.25.0, and forth to the current rc versions, nothing changed.

Again, if I do not include the object fields, I get all the data on the typesense side.

Object fields shows-up like this on the typesense side:

{
  "created_at": 1696283162,
  "default_sorting_field": "",
  "enable_nested_fields": true,
  "fields": [
    {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "work",
        "optional": true,
        "sort": false,
        "type": "object[]"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "",
        "optional": true,
        "sort": false,
        "type": "string[]"
      },
      {
        "facet": false,
        "index": true,
        "infix": false,
        "locale": "",
        "name": "work.job_title",
        "optional": true,
        "sort": false,
        "type": "string[]"
      },
  .....
  .....
}

which looks okay to me.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:39 PM
I see ok...

Since that error message you shared is not from Typesense, it's hard to debug this further from our side. May be AirByte can offer insights into what that error exactly is?
Oct 03, 2023 (2 months ago)
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
04:33 PM
Thanks. Jason that feels like it is an issue with the BigQuery connector on Airbyte. Found two related issues:

1. https://github.com/airbytehq/airbyte/issues/30179
2. https://github.com/airbytehq/airbyte/issues/4487
It looks the connector can’t handle the repeated fields really well. I can confirm that’s the case for me.

To avoid that error, we can save the Array of objects as string. Doing that though, we lose the ability to access those objects, as our "enable_nested_fields": true, will be useless. I wonder if there is a way to change the data type after the sync happens to typesense from string to object[] or if there is any alternative solutions.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
05:00 PM
way to change the data type after the sync happens to typesense from string to object[]
This is not possible to do in Typesense...
05:02
Jason
05:02 PM
One thing you could do is, once the data is synced into Typesense with a string datatype, then create a new collection in Typesense directly via the API with the correct schema, then export the docs from the airbyte-created collection and import the docs into the collection you created
05:02
Jason
05:02 PM
Oh although that won't help with on-going sync - it will just be a one-time thing
05:03
Jason
05:03 PM
Side note: have you considered building your own simple sync script that reads from BigQuery and ingests into Typesense directly via the API, considering these issues that you've run into?
Jamshid
Photo of md5-317ed6510eb7587c9e9243fb6ebc4e87
Jamshid
05:09 PM
Thanks! these are all very helpful. In terms of building a sync script, we haven’t thought about it yet. Have you seen that to be an easier process? we are a small team and dealing with a variety of sources and tasks and thought to delegate most of our work to the tools (of course when they work).
Oct 04, 2023 (2 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
04:13 PM
Building your own sync process is easier, if the alternatives don't work I guess!
04:13
Jason
04:13 PM
And you have complete control over the data transformations.
04:13
Jason
04:13 PM

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Setting `facet` as `true` in DB fields through Airbyte

Jamshid had an issue setting `facet` as `true` in DB fields through Airbyte. Jason shared the equivalent API endpoint and recommended upgrading Typesense to resolve an unusual bug.

2

8
2mo

Troubleshooting 400 Error When Upgrading Typesense Firestore Extension

Orion experienced a `400` error after updating the Typesense Firestore extension, causing issues with cloud functions. They traced the issue back to a data type conflict in their Typesense collection schema after updating. With help from Jason and Kishore Nallan, they resolved the issue by recreating the collection.

5

96
14mo

Handling Kinesis Stream Event Batching with Typesense

Dui had questions about how to handle Kinesis stream events with Typesense. Kishore Nallan suggested using upsert mode for creation/update and differentiating with logical deletion. After various discussions including identifying and resolving a bug, they finalized to introduce an `emplace` action in Typesense v0.23.

8

91
24mo

Typesense Bug Fix with `canceled_at` Field and Upgrade Concerns

Mateo reported an issue regarding the treatment of an optional field by Typesense which was confirmed a bug by Jason. After trying an upgrade, an error arose. Jason explained the bug was due to a recent change and proceeded to downgrade their version. Future upgrade protocols were discussed.

3

74
10mo

Updating Bulk Records and Resolving Typsense Issues.

Greg inquired about updating bulk records. Jason proposed using the documents/import endpoint for bulk updating, identified issues with Typesense, and provided solutions. Greg appreciated the assistance. Conversation related to the procedure was shared with Viktor.

8

63
13mo