Stephen Njau
12/10/2024, 7:32 AMuser_accesseses
Collection
{
"name": "user_accesseses",
"fields": [
{ "name": "user_uid", "type": "string"}, // e.g "auth|2343243434"
{ "name": "content_ids", "type": "string[]"}, // e.g ["media/45", "program/67", "media/567", "channel/45"]
{ "name": "inserted_at", "type": "int64"},
{ "name": "updated_at", "type": "int64"}
],
"default_sorting_field": "inserted_at"
}
2. programs
Collection
{
"name": "programs",
"fields": [
{ "name": "title"},
{ "name": "content_id", "type": "string"}, // e.g "media/45"
{ "name": "rating", "type": "float"},
{ "name": "duration", "type": "float" }
// ... other fields
],
"default_sorting_field": "rating"
}
3. meditations
Collection
{
"name": "meditations",
"fields": [
{ "name": "content_id", "type": "string"},
{ "name": "title", "type": "string" }
// ... other fields
],
"default_sorting_field": "title"
}
Data Volume
- Users: 1,000,000+
- Content IDs: ~20,000 (combination of programs and meditations)
Environment
- Typesense deployed in production with separate collections for user access and content metadata.
- Utilized Typesense’s API for indexing and searching.
- Two synchronization mechanisms:
- Real-Time Updates: On every CRUD operation, the specific collection is updated in Typesense.
- Sync-All Functionality: On demand: clears and re-syncs all records for bulk updates, primarily used during schema changes.
What We Have Tried
1. Using the string
Type for content_id
in Content Schemas
Approach:
We initially defined the content_id
field in both programs
and meditations
collections as a single string
.
This was intended to match the entries in the content_ids
array of the user_accesses
collection, facilitating filter-based searches.
Schema Example:
programs
Collection
{
"name": "programs",
"fields": [
{ "name": "content_id", "type": "string", "reference": "user_accesses.content_ids"}, // reference as a string
{ "name": "rating", "type": "float" },
{ "name": "duration", "type": "float" }
// ... other fields
],
"default_sorting_field": "rating"
}
Issues We Encountered
a. Type Mismatch: The content_ids
in user_accesses
are arrays of strings (string[]
), while content_id
in content
collections was a single string
. This mismatch indexing errors. I noticed that this required there to an entry on the user_accesses with the same same id before indexing.
`Reference document having`content_ids:= media/155` not found in the collection user_accesses
b. Synchronization Order Dependency: Indexing content collections before user_accesses
led to scenarios where content was not correctly associated with user access rights. Related to (a) above.
c. Incomplete Joins: When performing filter queries based on content_ids
, the single string
type in content collections did not align properly with the array type, leading to inaccurate search results.
2. Using the string[]
Type for content_id
in Content Schemas
Approach:
To address the type mismatch, we modified the content_id
field in both programs
and meditations
collections to be an
array of strings (string[]
). This was intended to align with the content_ids
in user_accesses
and facilitate proper filtering.
Schema Example:
programs
Collection
{
"name": "programs",
"fields": [
{ "name": "content_id", "type": "string[]", "reference": "user_accesses.content_ids" }, // reference on string[] field
{ "name": "rating", "type": "float" },
{ "name": "duration", "type": "float" }
// ... other fields
],
"default_sorting_field": "rating"
}
Issues We Encountered:
a. Handling New Users: when CRUD operations happen e.g on newly onboarded users' access rights were not reflected in search results without reindexing, as the existing synchronization mechanisms did not account for dynamic updates to content_ids
.
Needs all resyncing of related schema that reference the user_access's content_ids at the time of reindexing.
Request for Assistance
Given the above challenges, we would greatly appreciate your guidance on the following:
1. Efficient Implementation of JOIN-like Filtering:
- Best practices for implementing user-based filtering with JOIN operations factoring in that we have CRUD operations that update single records separately.
- Recommendations on schema design or query structuring to achieve our filtering goals effectively.
2. Handling Real-Time Updates and Scalability:
- Strategies to reflect user access changes in real-time without necessitating full reindexing.
- Suggestions for managing large-scale user and content datasets (1M+ users, ~20k content_ids
).Kishore Nallan
12/10/2024, 10:01 AMHarpreet Sangar
12/13/2024, 6:32 AMuser_accesseses
collection,
{ "name": "content_ids", "type": "string[]"}, // e.g ["media/45", "program/67", "media/567", "channel/45"]
is this field storing all the content_ids
of the programs
and meditations
that a particular user has access to?Stephen Njau
12/16/2024, 12:32 PMHarpreet Sangar
12/16/2024, 12:36 PMuser_accesseses
collection for each content_ids like:
program_content_ids
, meditation_content_ids
, etc. Also, make these fields reference the respective collections like: programs.content_id
, meditations.content_id
, etc.