Hi Team, we are trying to restrict search results ...
# community-help
s
Hi Team, we are trying to restrict search results content (programs, meditations) to what each user has access to. Below, I’ve outlined our context, the approaches we’ve tried, and the challenges we’re encountering. We would greatly appreciate your guidance and recommendations. And or experience. Context We already . When users search for quests, meditations, or soundscapes, the search should return results only from content the user has access to. Current Collections and Structures 1.
user_accesseses
Collection
Copy code
{
  "name": "user_accesseses",
  "fields": [
    { "name": "user_uid", "type": "string"}, // e.g "auth|2343243434"
    { "name": "content_ids", "type": "string[]"}, // e.g ["media/45", "program/67", "media/567", "channel/45"]
    { "name": "inserted_at", "type": "int64"},
    { "name": "updated_at", "type": "int64"}
  ],
  "default_sorting_field": "inserted_at"
}
2.
programs
Collection
Copy code
{
  "name": "programs",
  "fields": [
    { "name": "title"},
    { "name": "content_id", "type": "string"}, // e.g "media/45"
    { "name": "rating", "type": "float"},
    { "name": "duration", "type": "float" }
    // ... other fields
  ],
  "default_sorting_field": "rating"
}
3.
meditations
Collection
Copy code
{
  "name": "meditations",
  "fields": [
    { "name": "content_id", "type": "string"},
    { "name": "title", "type": "string" }
    // ... other fields
  ],
  "default_sorting_field": "title"
}
Data Volume - Users: 1,000,000+ - Content IDs: ~20,000 (combination of programs and meditations) Environment - Typesense deployed in production with separate collections for user access and content metadata. - Utilized Typesense’s API for indexing and searching. - Two synchronization mechanisms: - Real-Time Updates: On every CRUD operation, the specific collection is updated in Typesense. - Sync-All Functionality: On demand: clears and re-syncs all records for bulk updates, primarily used during schema changes. What We Have Tried 1. Using the
string
Type for
content_id
in Content Schemas
Approach: We initially defined the
content_id
field in both
programs
and
meditations
collections as a single
string
. This was intended to match the entries in the
content_ids
array of the
user_accesses
collection, facilitating filter-based searches. Schema Example:
programs
Collection
Copy code
{
  "name": "programs",
  "fields": [
    { "name": "content_id", "type": "string", "reference": "user_accesses.content_ids"}, // reference as a string
    { "name": "rating", "type": "float" },
    { "name": "duration", "type": "float" }
    // ... other fields
  ],
  "default_sorting_field": "rating"
}
Issues We Encountered a. Type Mismatch: The
content_ids
in
user_accesses
are arrays of strings (
string[]
), while
content_id
in content collections was a single
string
. This mismatch indexing errors. I noticed that this required there to an entry on the user_accesses with the same same id before indexing. `Reference document having`content_ids:= media/155` not found in the collection
user_accesses
b. Synchronization Order Dependency: Indexing content collections before
user_accesses
led to scenarios where content was not correctly associated with user access rights. Related to (a) above. c. Incomplete Joins: When performing filter queries based on
content_ids
, the single
string
type in content collections did not align properly with the array type, leading to inaccurate search results. 2. Using the
string[]
Type for
content_id
in Content Schemas
Approach: To address the type mismatch, we modified the
content_id
field in both
programs
and
meditations
collections to be an array of strings (
string[]
). This was intended to align with the
content_ids
in
user_accesses
and facilitate proper filtering. Schema Example:
programs
Collection
Copy code
{
  "name": "programs",
  "fields": [
    { "name": "content_id", "type": "string[]", "reference": "user_accesses.content_ids" }, // reference on string[] field     
    { "name": "rating", "type": "float" },
    { "name": "duration", "type": "float" }
    // ... other fields
  ],
  "default_sorting_field": "rating"
}
Issues We Encountered: a. Handling New Users: when CRUD operations happen e.g on newly onboarded users' access rights were not reflected in search results without reindexing, as the existing synchronization mechanisms did not account for dynamic updates to
content_ids
. Needs all resyncing of related schema that reference the user_access's content_ids at the time of reindexing. Request for Assistance Given the above challenges, we would greatly appreciate your guidance on the following: 1. Efficient Implementation of JOIN-like Filtering: - Best practices for implementing user-based filtering with JOIN operations factoring in that we have CRUD operations that update single records separately. - Recommendations on schema design or query structuring to achieve our filtering goals effectively. 2. Handling Real-Time Updates and Scalability: - Strategies to reflect user access changes in real-time without necessitating full reindexing. - Suggestions for managing large-scale user and content datasets (1M+ users, ~20k
content_ids
).
k
cc @Harpreet Sangar
h
@Stephen Njau In the
user_accesseses
collection,
Copy code
{ "name": "content_ids", "type": "string[]"}, // e.g ["media/45", "program/67", "media/567", "channel/45"]
is this field storing all the
content_ids
of the
programs
and
meditations
that a particular user has access to?
s
@Harpreet Sangar Correct
h
Okay. You should create separate fields in the
user_accesseses
collection for each content_ids like:
program_content_ids
,
meditation_content_ids
, etc. Also, make these fields reference the respective collections like:
programs.content_id
,
meditations.content_id
, etc.
1