在 BigQuery 中按最近日期加入具有重复记录的表 [英] Join by nearest date for the table with duplicate records in BigQuery
本文介绍了在 BigQuery 中按最近日期加入具有重复记录的表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有 installs
表,其中包含具有相同 user_id
但不同 install_date
的安装.我想通过 install_date
将所有收入记录与最近的安装记录连接起来,该记录小于 revenue_date
,因为我需要它的 source
字段值进行下一次处理.这意味着输出行数应等于收入表记录.在 BigQuery 中如何实现?
I have installs
table with installs that have the same user_id
but different install_date
.
I want to get all revenue records joined with nearest install record by install_date
that is less then revenue_date
because I need it's source
field value for next processing.
That means that output rows count should be equal to revenue table records.
How can it be achieved in BigQuery?
这是数据:
installs
install_date user_id source
--------------------------------
2020-01-10 user_a source_I
2020-01-15 user_a source_II
2020-01-20 user_a source_III
***info about another users***
revenue
revenue_date user_id revenue
--------------------------------------------
2020-01-11 user_a 10
2020-01-21 user_a 20
***info about another users***
推荐答案
考虑以下解决方案
select any_value(r).*,
array_agg(
(select as struct i.* except(user_id))
order by install_date desc
limit 1
)[offset(0)].*
from `project.dataset.revenue` r
join `project.dataset.installs` i
on i.user_id = r.user_id
and install_date < revenue_date
group by format('%t', r)
如果应用于您问题中的样本数据 - 输出为
If applied to sample data in your question - output is
这篇关于在 BigQuery 中按最近日期加入具有重复记录的表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文