如何在BigQuery中缩放转轴? [英] How to scale Pivoting in BigQuery?

查看:100
本文介绍了如何在BigQuery中缩放转轴?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我有音乐视频播放统计表mydataset.stats给定的一天(3B行,1M用户,6K艺术家)。
简化模式是:
UserGUID String,ArtistGUID String

我需要将艺术家从行转移到列,因此模式将为: >
UserGUID字符串,Artist1 Int,Artist2 Int,... Artist8000 Int

艺术家的播放次数由各个用户计数

有一种方法建议在如何将行转换为BigQuery / SQL中具有大量数据的列?,但看起来它不能为我的示例中的数字进行缩放



这个方法可以扩展为我的例子吗?

低于6000的方法功能和它按预期工作。我相信它可以达到10K的功能,这是对表格中列数的硬限制



第1步 - 用户/艺术家

 选择userGUID as uid,artistGUID as aid,COUNT(1)as play 
FROM [mydataset.stats] GROUP BY 1,2

第2步 - 标准化uid和aid - so它们是连续的数字1,2,3 ......。

我们至少需要这样做有两个原因:a)以后动态创建sql尽可能紧凑,并且b)有更多可用/友好的列名称



结合第一步 - 这将是:

$ p $ SELECT u .uid AS uid,a.aid AS aid播放
FROM(
SELECT userGUID,artistGUID,COUNT(1)AS播放
FROM [mydataset.stats]
GROUP BY 1 ,2
)AS s
JOIN(
SELECT userGUID,ROW_NUMBER()OVER()AS uid FROM [mydataset.stats] GROUP BY 1
)AS u ON u。 userGUID = s.userGUID
JOIN(
SELECT artistGUID,ROW_NUMBER()OVER()AS援助自[mydataset.stats] GROUP BY 1
)作为ON a.artistGUID = s.artistGUID

让我们将输出写入表 - mydataset.aggs



第3步 - 每次使用N个特征(艺术家)的已建议(在上述问题中)的方法。
在我的具体例子中,通过实验,我发现基本方法对于2000到3000之间的特征数量效果很好。
为了安全起见,我决定一次使用2000个特征

以下脚本用于动态生成查询,然后运行以创建分区表

  SELECT'SELECT uid,'+ 
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid ='+ STRING(aid)+',plays,NULL))as'+ STRING(aid)

+'FROM [mydataset.aggs] GROUP EACH BY uid'
FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid> 0 and aid< 2001)

上面的查询生成另一个查询,如下所示:

<$ p $ (IF(aid = 1,plays,NULL))a1​​,SUM(IF(aid = 3,plays,NULL))a3,
SUM(IF = 2,播放,NULL))a2,SUM(IF(aid = 4,plays,NULL))a4。 。 。
FROM [mydataset.aggs] GROUP EACH BY uid

这应该运行并写入 mydataset.pivot_1_2000



执行STEP 3两次(调整 HAVING aid> NNNN和援助< NNNN ),我们得到三个表 mydataset.pivot_2001_4000 mydataset.pivot_4001_6000

正如你所看到的 - mydataset.pivot_1_2000预期的模式,但从1到2001年的援助功能; mydataset.pivot_2001_4000仅具有2001年至4000年的援助功能;等等



第4步 - 将所有分区数据透视表合并到最终数据透视表中,并将所有特性表示为一个表中的列



与上述步骤相同。首先,我们需要生成查询,然后运行它
所以,最初我们将缝合mydataset.pivot_1_2000和mydataset.pivot_2001_4000。然后结果为mydataset.pivot_4001_6000

  SELECT'选择x.uid uid,'+ 
GROUP_CONCAT_UNQUOTED(
'a'+ STRING(aid)

+'FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid< 4001 ORDER BY aid)

应该运行上面的输出字符串并将结果写入 mydataset.pivot_1_4000



然后我们重复如下所示的STEP 4

  SELECT'选择x.uid uid,'+ 
GROUP_CONCAT_UNQUOTED(
'a'+ STRING(aid)

+'FROM [mydataset.pivot_1_4000] AS x
加入每个[mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM(SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid< 6001 ORDER BY aid)

将结果写入 mydatas et.pivot_1_6000



结果表格具有以下模式:

  uid int,a1 int,a2 int,a3 int,。 。 。 ,a5999 int,a6000 int 

注意:

a 。我尝试了这种方法,但最多只能使用6000种功能,并且按照预期的那样工作 -
b 。第3步和第4步中的第二个/主要查询的运行时间从20分钟到60分钟不等

c 。重要提示:步骤3和步骤4中的计费层从1到90不等。好消息是相应的表的大小相对较小(30-40MB),计费字节也是如此。对于2016年以前的项目,所有项目都被视为第1级,但2016年10月之后,这可能会成为问题。

更多信息,请参见时间 href =https://cloud.google.com/bigquery/pricing#high-compute =noreferrer>高级计算查询


d 。以上例子展示用BigQuery进行大规模数据转换的功能!尽管如此,我认为(但我可能错了)存储物化特征矩阵并不是最好的想法

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user

There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example

Can this approach be scaled for my example?

解决方案

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

STEP 1 - Aggregate plays by user / artist

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

Combined with first step – it will be:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID 

Let’s write output to table - mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

Below script is used for dynamically generating query that then run to create partitioned tables

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid 

This should be run and written to mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

Same as in above steps. First we need generate query and then run it So, initially we will "stitch" mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

Then we repeat STEP 4 like below

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

The resulted table has following schema:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For "before 2016" projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

这篇关于如何在BigQuery中缩放转轴?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆