#community-help

Issue with Field Indexing and Multiple Data Types

TLDR Raymond encountered an issue where a field seemed to be indexed twice and hence couldn't be deleted. Jason advised upgrading to a patch version, but the problem remained. Kishore Nallan suspected a race condition and an issue with conflicting data types. An effective solution wasn't achieved.

Powered by Struct AI

2

1

1

1

Aug 10, 2022 (17 months ago)
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
06:46 PM
Hey guys 👋 Is it possible to have an field indexed twice?

I can't seem to delete either them. I've also attached the response for the delete operation
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:53 PM
> Is it possible to have an field indexed twice?
No, this shouldn’t be possible. Unless there’s a bug. May I know what the left side vs right side content is in your diff?

> I’ve also attached the response for the delete operation
Hmm, I vaguely remember us fixing an issue related to auto fields and dropping in v0.23.1. Could you try this after upgrading to that patch version?
06:54
Jason
06:54 PM
re: downtime, since it’s a single node cluster, you’ll see a downtime of about 20 minutes for your dataset
06:54
Jason
06:54 PM
Ok to queue up the upgrade?

1

Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
06:54 PM
Sure, I'll try it again after the patch.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:55 PM
In progress now… I’ll keep you posted

1

Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
06:55 PM
re:left vs right side

Both are existing fields in the schema, both represent date fields from GitHub but some webhooks send it as string and some as int64, ideally it should not create a new one, right?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:56 PM
Yeah, it shouldn’t create a new one ideally and just error out when a different data type is encountered.
06:56
Jason
06:56 PM
I only see one field in your schema though
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
06:57 PM
Yeah it only shows one on the UI, but two in the API response
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:58 PM
Oh hmmm! API response from GET /collections/name endpoint?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
06:59 PM
Yep
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:40 PM
Your upgrade is now complete

1

07:41
Jason
07:41 PM
Btw, I see two collections: events and events-2 and each of those have different data types for the same field, which is expected because field data types are only specific to a particular collection
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
07:43 PM
yeah, I created the second one when the first one wasn't working well to validate the field type
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:47 PM
Even via the API I only see one instance of data.repository.created_at. Could you share the curl command you’re using along with the output?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
07:47 PM
Yeah, it's only one now
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:48 PM
Hmmm, ok let me know if you see it again. May be another fix in 0.23.1 addressed this issue as well
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
07:50 PM
I still keep seeing this response
{
    "message": "Schema change is incompatible with the type of documents already stored in this collection. Existing data for field `data.repository.pushed_at` cannot be coerced into a string."
}
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:51 PM
Is this on the events or events-2 collection?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
07:52 PM
events
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:55 PM
I’m able to replicate the error message. Need to take a closer look… In the meantime, I’d recommend just creating a new collection to work with
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
07:59 PM
Ok, thanks
Aug 11, 2022 (17 months ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
03:48 AM
I think there are two different issues.

1. Somehow data.repository.created_at has ended up with two different string and int64 types in the schema.
2. There is also another issue of data.repository.pushed_at having a different type on-disk compared to the schema which prevents schema modification.
I will look into why these happened.
04:13
Kishore Nallan
04:13 AM
Raymond Can you tell me more about how you index these documents? Do you run a single import job or is there a possibility of multiple independent import/indexing jobs running in parallel?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:41 AM
I normally run a single job that reads off a MongoDB change stream but last night I needed to debug and ran the change stream from my laptop. The hosted indexer is running NodeJs code and the local (new) version is running Go 1.18.

That might have caused the issue with the date string to int64 conversion
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:42 AM
Do you use the import method?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:43 AM
> There is also another issue of data.repository.pushed_at having a different type on-disk compared to the schema which prevents schema modification.
For this case I sincerely don't know what to do about it because I already dropped the field.
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:44 AM
But it still exists in the schema? Can you check?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:46 AM
It was deleted when I dropped it. Then I added a new record with string and it changed type to string.
10:46
Raymond
10:46 AM
But for some reason it still keeps coming up when I try to delete the other record
10:46
Raymond
10:46 AM
We can huddle if you have some time
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:47 AM
Ah I think what happened with the pushed_at. Let me try to explain it.

1

10:50
Kishore Nallan
10:50 AM
First you have pushed_at as an int64 field and then when you drop it, the field is removed from schema. At this point when you add the same field as a string, now the auto schema detector detects it as string and the schema gets the field back. But the disk already contains documents which contain the int64 value. At this point any form of schema update operation is going to fail because schema has string but there are some docs with integer value.

So any schema operation is going to fail.
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:52 AM
Yep, this is what I assumed too, is it possible to find those affected records?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:57 AM
You've to export the docs and search for it..not possible to exhaustively find bad data as there could be one or more records with string values.
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:58 AM
Hmm ok, thanks
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:59 AM
The original issue of same field names having two types is probably a race condition. From what I can grok the go code seems to call the index method in a for loop? Does this happen one at a time or parallely?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
10:59 AM
This seems like something that can potentially ruin a search dataset though, one bad record can mess up the whole data set
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:00 AM
Yes this is the first time we have an interplay between auto schema detection and schema change so I've to think about how to handle this.
11:00
Kishore Nallan
11:00 AM
Maybe during schema change we can give an option to ignore bad records. That will ensure we are not stuck.
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
11:01 AM
The for loop is the so the code behaves like an event loop. The code actually only runs when a new record is inserted
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:02 AM
Can multiple index operations occur parallely?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
11:02 AM
Nope, one at a time
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
11:22 AM
Ok let me look into this further.

1

12:56
Kishore Nallan
12:56 PM
Do you remember if you made any schema changes on the created_at field as well?
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
12:57 PM
No I didn't, that one hasn't had any issues yet
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
01:05 PM
I see. Is it easy for you to run your ingestion pipeline locally (maybe on local docker Typesense) to see if the same field, two types issue is easy to reproduce?
Aug 12, 2022 (17 months ago)
Raymond
Photo of md5-d1db605d34a264b4a156daf03e69661a
Raymond
08:12 AM
I can try to set this up. Apologies, had a busy day yesterday. Though it might not be that easy to repro because I'm not sure which GitHub event is responsible for it
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:34 AM
👍 no worries

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Handling Kinesis Stream Event Batching with Typesense

Dui had questions about how to handle Kinesis stream events with Typesense. Kishore Nallan suggested using upsert mode for creation/update and differentiating with logical deletion. After various discussions including identifying and resolving a bug, they finalized to introduce an `emplace` action in Typesense v0.23.

8

91
24mo

Troubleshooting 400 Error When Upgrading Typesense Firestore Extension

Orion experienced a `400` error after updating the Typesense Firestore extension, causing issues with cloud functions. They traced the issue back to a data type conflict in their Typesense collection schema after updating. With help from Jason and Kishore Nallan, they resolved the issue by recreating the collection.

5

96
14mo

Typesense Bug Fix with `canceled_at` Field and Upgrade Concerns

Mateo reported an issue regarding the treatment of an optional field by Typesense which was confirmed a bug by Jason. After trying an upgrade, an error arose. Jason explained the bug was due to a recent change and proceeded to downgrade their version. Future upgrade protocols were discussed.

3

74
10mo

Threading Problem During Multiple Collection Creation and Batch Insertion in Typesense

Johan has a problem with creating multiple collections and batch-inserting documents into Typesense, which is returning results from different collections. Kishore Nallan helps troubleshoot the issue and suggests a potential local race condition, which is fixed in a later build.

35
17mo

Large JSONL Documents Import Issue & Resolution

Suraj was having trouble loading large JSONL documents into Typesense server. After several discussions and attempts, it was discovered that the issue was due to data quality. Once the team extracted the data again, the upload process worked smoothly.

run

4

94
9mo