如何在 BigQuery 中扩展 Pivoting? [英] How to scale Pivoting in BigQuery?

查看:17
本文介绍了如何在 BigQuery 中扩展 Pivoting?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有给定日期的音乐视频播放统计表 mydataset.stats(3B 行、1M 用户、6K 艺术家).简化模式是:UserGUID 字符串、ArtistGUID 字符串

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

我需要将艺术家从行转置到列,因此架构将是:
UserGUID 字符串、Artist1 Int、Artist2 Int、... Artist8000 Int
艺术家播放次数按相应用户计算

I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user

如何在 BigQuery/SQL 中将行转换为具有大量数据的列?如何在 Google BigQuery 中为数千个类别创建虚拟变量列? 但看起来它不能扩展我的例子中的数字

There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example

这种方法可以在我的示例中进行扩展吗?

Can this approach be scaled for my example?

推荐答案

我尝试了以下方法来处理多达 6000 个功能,并且按预期工作.我相信它最多可以使用 10K 个功能,这是表中列数的硬限制

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

第 1 步 - 按用户/艺术家汇总播放

STEP 1 - Aggregate plays by user / artist

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

第 2 步 – 规范化 uid 和 aid – 所以它们是连续的数字 1、2、3、…….
我们至少有两个原因需要这样做:a) 使以后动态创建的 sql 尽可能紧凑 b) 有更多可用/友好的列名

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

结合第一步——它将是:

Combined with first step – it will be:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID 

让我们将输出写入表 - mydataset.aggs

Let’s write output to table - mydataset.aggs

第 3 步 – 一次对 N 个特征(艺术家)使用已经建议(在上述问题中)的方法.在我的特定示例中,通过试验,我发现基本方法适用于 2000 到 3000 之间的特征数量.为了安全起见,我决定一次使用 2000 个功能

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

下面的脚本用于动态生成查询,然后运行以创建分区表

Below script is used for dynamically generating query that then run to create partitioned tables

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

上面的查询产生另一个如下所示的查询:

Above query produces yet another query like below:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid 

这应该运行并写入mydataset.pivot_1_2000

再执行 STEP 3 两次(调整 HAVING 辅助 > NNNN 和辅助 ),我们又得到三个表 mydataset.pivot_2001_4000mydataset.枢轴_4001_6000
如您所见 - mydataset.pivot_1_2000 具有预期的架构,但对于从 1 到 2001 的辅助功能;mydataset.pivot_2001_4000 仅具有 2001 到 4000 的辅助特征;等等

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

第 4 步 – 将所有分区数据透视表合并到最终数据透视表,所有特征都表示为一个表中的列

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

与上述步骤相同.首先我们需要生成查询然后运行它因此,最初我们将缝合" mydataset.pivot_1_2000 和 mydataset.pivot_2001_4000.然后结果为 mydataset.pivot_4001_6000

Same as in above steps. First we need generate query and then run it So, initially we will "stitch" mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

应该运行上面的输出字符串并将结果写入mydataset.pivot_1_4000

Output string from above should be run and result written to mydataset.pivot_1_4000

然后我们像下面一样重复第 4 步

Then we repeat STEP 4 like below

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

要写入mydataset.pivot_1_6000

结果表具有以下架构:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

注意:
.我最多只尝试了 6000 个功能,并且按预期工作了
b.第 3 步和第 4 步中第二个/主要查询的运行时间从 20 到 60 分钟不等
c.重要提示:第​​ 3 步和第 4 步中的计费层从 1 到 90 不等.好消息是各自表的大小相对较小(30-40MB),计费字节也是如此.对于2016 年之前"的项目,所有项目都被视为第 1 层,但在 2016 年 10 月之后,这可能是一个问题.
如需了解详情,请参阅高计算查询计时
>
d.上面的例子展示了使用 BigQuery 进行大规模数据转换的威力!我仍然认为(但我可能是错的)存储物化特征矩阵不是最好的主意

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For "before 2016" projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

这篇关于如何在 BigQuery 中扩展 Pivoting?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆