在BigQuery中按重复日期按表的最近日期进行联接 [英] Join by nearest date for the table with duplicate records in BigQuery

查看:46
本文介绍了在BigQuery中按重复日期按表的最近日期进行联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 installs 表,其安装具有相同的 user_id ,但具有不同的 install_date .我想在 install_date 之前将所有收入记录与最近的安装记录合并,该记录要比 revenue_date 少,因为我需要将其作为 source 字段值进行下一步处理.这意味着输出行数应等于收入表记录.如何在BigQuery中实现?

I have installs table with installs that have the same user_id but different install_date. I want to get all revenue records joined with nearest install record by install_date that is less then revenue_date because I need it's source field value for next processing. That means that output rows count should be equal to revenue table records. How can it be achieved in BigQuery?

以下是数据:

installs
install_date    user_id     source
--------------------------------
2020-01-10      user_a      source_I           
2020-01-15      user_a      source_II
2020-01-20      user_a      source_III
***info about another users***

revenue
revenue_date    user_id     revenue
--------------------------------------------
2020-01-11      user_a      10
2020-01-21      user_a      20
***info about another users***

推荐答案

请考虑以下解决方案

select any_value(r).*, 
    array_agg(
        (select as struct i.* except(user_id)) 
        order by install_date desc 
        limit 1
    )[offset(0)].*
from `project.dataset.revenue` r 
join `project.dataset.installs` i 
on i.user_id = r.user_id 
and install_date < revenue_date
group by format('%t', r)  

如果应用于问题中的样本数据-输出为

If applied to sample data in your question - output is

这篇关于在BigQuery中按重复日期按表的最近日期进行联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆