Is it possible to do joins on collections articles...
# community-help
m
Is it possible to do joins on collections articles and pageviews with articles.id = pageviews.articleId, without explicitly setting references in the schema? We have collections that get updates from streams two streams and the problem we have, is that we aren’t sure of the order in which receive updates, a pageview event may arrive before the article, which gives us 404s. Are we approaching this correctly? We could put articles and pageviews into the same collection using
emplace
, but we have a follow-up usecase where pageviews can come from different a different and dynamic number of sources and that wont work very nicely with the everything in one collection approach
f
Why not use references?
m
Because we may receive the pageview data before we receive the article data and then we get a 404 when inserting
f
Try setting it as optional and updating the doc
m
Would we then have to reinsert the pageviews when the article comes in?
f
Yes
m
That doesn’t work then, since that doesn’t let us process the incoming streams independently.
f
Could you explain how the workflow looks like? You're getting pageview data before getting the article data. But the pageview data references an article? Shouldn't an article reference a pageview in that case?d
m
Generally yes, that’s what you would expect, but the data is coming from different systems and CMSses which are not under our control and we have cases where the pageviews data was made available to us months ago (we store it in a Kafka-topic), while the articles are only slowly being backfilled on a case by case basis.
f
Is the article being referenced by a pageview or a pageview by an article?
m
pageviews has an articleId field which references article.id
f
So an article can be on many pageviews but a pageview can have a single article? And a pageview can't exist without an article, but it's being filled before an article?
m
An article only has one pageview object (the latest), but we will receive many pageview update events over the lifetime of an article. A pageview cannot logically exist without an article, but we may not be told about the article before we know about the pageview.
The pageview event is an aggregate of views over the articles lifetime. It is being computed by another team. Right now we get the daily, but may get them with other frequencies in the future. Other teams still are responsible for providing their articles. When we get a pageview update we may have the article already, we might receive it in the future, or we might never receive it
f
For your issue with pageviews arriving before articles, I'd suggest: 1. Make your pageviews schema flexible enough to accept entries even when the referenced article doesn't exist yet 2. Use a background job to periodically "reconcile" these orphaned pageviews with articles as they arrive
k
I just realized that we haven't properly documented the
async_reference
property that relaxes the ordering constraint for indexing on collections which refer to each other. Please see this: https://github.com/typesense/typesense/issues/1675#issuecomment-2337604801
m
That’s awesome and exactly what we need!
f
Added a section in the docs to document this: https://github.com/typesense/typesense-website/pull/298
m
Thanks!
Migrated to async_references and it solves the use case perfectly
👍 2