使用bigquery连接来自两个来源的数据 [英] Joining data from two sources using bigquery

查看:62
本文介绍了使用bigquery连接来自两个来源的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以检查下面的代码是否正确?在cte_1中,我将从t1的摘要中提取所有维度和指标,值1,值2,值3.在cte_2中,我找到了t2的唯一行号.在cte_3中,我使用两个键(例如Date和Ad)上的join来获取所有不同的维度和指标.在cte_4中,我仅获取行号1的值.我得到sum(value1),sum(value2),sum(value3)正确,但sum(value4)不正确

Can anyone please check whether below code is correct? In cte_1, I’m taking all dimensions and metrics from t1 excpet value1, value2, value3. In cte_2, I’m finding the unique row number for t2. In cte_3, I’m taking all distinct dimensions and metrics using join on two keys such as Date, and Ad. In cte_4, I’m taking the values for only row number 1. I’m getting sum(value1),sum(value2),sum(value3) correct ,but sum(value4) is incorrect

WITH cte_1 AS
(SELECT *except(value1, value2, value3) FROM t1 where Date >"2020-02-16" and Publisher ="fb")
-- Find unique row number from t2--
,cte_2 as(
SELECT  ROW_NUMBER() OVER(ORDER BY Date) distinct_row_number, * FROM t2

,cte_3 as
(SELECT cte_2.*,cte_1.*except(Date) FROM cte_2 join cte_1  
on cte_2.Date = cte_1. Date 
and cte_2.Ad= cte_1.Ad)) 

,cte_4 AS (
(SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY distinct_row_number ORDER BY Date) as rn 
FROM cte_3 ) T
where rn = 1 ))

select  sum(value1),sum(value2),sum(value3),sum(value4) from cte_4

请参见下面的示例表:

Please see the sample table below:

推荐答案

虽然您的数据似乎不符合您共享的查询,因为它缺少名为 Ad 的字段和其他字段具有不同的名称,例如 Date ReportDate ,我能够确定一些问题并提出改进建议.

Whilst your data does not seem compliant with the query you shared, since it is lacking the field named Ad and other fields have different names, such as Date and ReportDate, I was able to identify some issues and propose improvements.

首先 ,在临时表 cte_1 中,您仅在

First, within your temp table cte_1, you are only using a filter in the WHERE clause, you could use it within your from statement in your last step, such as :

SELECT * FROM (SELECT field1,field2,field3 FROM t1 WHERE Date > DATE(2020,02,16) )

第二 ,在 cte_2 中,您需要从表中选择所有需要的列 t2 .否则,您的表将仅具有 行号 ,并且一旦不提供任何其他信息,就无法将其与其他表连接.因此,如果需要行号,则将其与其他列一起选择,如果将来要执行任何联接,则必须包括主键.语法如下:

Second, in cte_2, you need to select all the columns you will need from the table t2. Otherwise, your table will have only the row number and it won't be possible to join it with other tables, once it does not provide any other information. Thus, if you need the row number, you select it together with the other columns, which it has to include your primary key if you will perform any join in the future. The syntax would be as follows:

SELECT field1, field2, ROW_NUMBER() OVER(ORDER BY Date) FROM t2 

第三 ,在 cte_3 中,我假设您要执行

Third, in cte_3, I assume you want to perform an INNER JOIN. Thus, you need to make sure that the primary keys are present in both tables, in your case Date and Ad, which I could not find within your data. Furthermore, you can not have duplicated names when joining two tables and selecting all the columns. For example, in your case you have Brand, value 1, value 2 and value 3 in both tables, it will cause an error. Thus, you need to specify where these fields should come from by selecting one by one or the using a EXCEPT clause.

最后 ,位于 cte_4 中,而您的 最终选择 可以一步到位.基本上,您只选择一行数据按日期排序.然后根据日期划分分别对字段值1,值2和值3求和.此外,您没有为总和选择任何标识符,这意味着您的表将仅具有最终总和.通常,在执行聚合时,例如SUM(),还选择了主键.最后,可以仅使用t2中的数据按以下步骤执行此步骤:

Finally, in cte_4 and your final select could be together in one step. Basically, you are selecting only one row of data ordered by Date. Then summing the fields value 1, value 2 and value 3 individually based on the partition by date. Moreover, you are not selecting any identifier for the sum, which means that your table will have only the final sums. In general, when peforming a aggregation, such as SUM(), the primary key(s) is selected as well. Lastly, this step could have been performed in one step such as follows, using only the data from t2:

SELECT ReportDate, Brand, sum(value1) as sum_1,sum(value2)  as sum_1,sum(value3)  as sum_1, sum(value4)  as sum_1 FROM (SELECT t2.*, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY Date) as rn t2)
WHERE rn=1
GROUP BY ReportDate, Brand

更新:在注释部分中有您的解释.我能够创建一个更具体的查询.字段 ReportDate,Brand,Portfolio,Campaign和value1,value2,value3 来自 t2 .而 value4 来自 t1 .基于等于1的行号进行总和.因此,在使用 ROW_NUMBER()之前,将表 t1 t2 联接在一起. .最后,在最后一个Select语句中,未选择 rn ,并且根据 ReportDate,Brand,Portfolio和t2.Campaign 汇总了数据.

UPDATE: With your explanation in the comment section. I was able to created a more specific query. The fields ReportDate,Brand,Portfolio,Campaign and value1,value2,value3 are from t2. Whilst value4 is from t1. The sum is made based on the row number equals to 1. For this reason, the tables t1 and t2 are joined before being using ROW_NUMBER(). Finally, in the last Select statement rn is not selected and the data is aggregated based on ReportDate, Brand, Portfolio and t2.Campaign.

WITH cte_1 AS (
SELECT t2.ReportDate, t2.Brand, t2.Portfolio, t2.Campaign, 
t2.value1, t2.value2, t2.value3, t1.value4 
FROM t2 LEFT JOIN t1 on t2.ReportDate = t1.ReportDate and t1.placement=t2.Ad
),
cte_2 AS(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Date ORDER BY ReportDate) as rn FROM cte_1
)
SELECT ReportDate, Brand, Portfolio, Campaign, SUM(value1) as sum1,  SUM(value2) as sum2,  SUM(value3) as sum3,
 SUM(value4) as sum4
FROM cte_2
WHERE rn=1
GROUP BY 1,2,3,4

这篇关于使用bigquery连接来自两个来源的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆