在Google BigQuery中使用深度排序的通用数据透视表 [英] Generalized Pivot Table with Deep Sort in Google BigQuery

查看:172
本文介绍了在Google BigQuery中使用深度排序的通用数据透视表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是



现在,我想建立一个数据透视表表具有以下属性:

$ ul

  • 行和列级别的嵌套级别(上一个问题只有嵌套cols) li>
  • 行和列中的小计(以前只有总计)

  • 多个指标(以前只有一个指标)

  • 多种排序 - 通过深度指标和按字母排序(以前没有任何排序条件)限额(以前没有任何限制)



  • 这里是在Google表格中内置的数据透视表 -





    这里的概念性SQL语句是:

    $ p $选择
    SUM(价格),
    COUNT(价格)
    破折号
    工作室(行),
    标题(行)
    地区编号(col),
    类型(col)
    SORTED / LIMITED BY
    Studio ==> A-Z,限制3,
    标题==> SUM(价格)在GRAND TOTAL DESC,限制4,
    区域ID ==> COUNT(价格)在Paramount TOTAL,LIMIT 2
    Type ==> AZ,NO LIMIT

    我不确定如何在概念上显示小计,但我们应该能够为每个分解字段指定这些字段。

    是否可以在Google BigQuery中的单个SQL语句中执行上述操作?什么是生成它的步骤?

    解决方案


    Q 。如果我们进行聚合并获得10M结果会怎样?除非我们在bigquery中应用限制等 - 传输的数据量将会占用大量金额......







    让我们来澄清这里的挑战:





    <因此,通常情况下,您可以在后端运行类似下面的内容,并将结果提交到可视化工具(前端),以便进行进一步操作,如排序,限制,旋转等。

    #standardSQL
    SELECT
    Studio,
    标题,
    TerritoryID,
    类型,
    SUM(价格)AS价格,
    COUNT(1)AS批量
    FROM YourTable
    GROUP BY工作室,标题,TerritoryID,类型
    你想减小它的大小而不影响你的仍然能够在前端的数据透视/可视化中呈现最终数据







    A 。建议/解决方案

    以下显示了如何通过在后端应用排序和限制来实现此目的(因此结果大小显着减少)w / o失去做旋转的能力,仍然显示总数等。



    让我们以简化的开始进行最终查询




    • 初始查询(骨架)


      我们假设,根据已知标准,我们知道提前哪些工作室,标题,区域和类型应该被选中

      在这种情况下,下面的查询将返回所需的数据


       SELECT'Fox'
      UNION ALL SELECT'Paramouont'
      ),
      标题 #standardSQL
      WITH AS AS(
      SELECT'Fox'AS Studio,'Best Laid Plans'AS Title
      UNION ALL SELECT'Fox','Homecoming'
      UNION ALL SELECT'Paramount','Titanic'
      UNION ALL SELECT'Paramount', 'Homecoming'
      ),
      地区AS(
      SELECT'US'作为TerritoryID
      UNION ALL SELECT'GB'
      ),
      总计AS(
      SELECT
      IFNULL(b.Studio,'Other')AS Studio,
      IFNULL(b.Title,'其他')AS标题,
      IFNULL(c.TerritoryID,'其他')AS TerritoryID,
      类型,
      ROUND(SUM(Price),2)AS价格,COUNT(1)AS交易量
      FROM yourTable AS a
      LEFT JOIN标题AS b ON a.Studio = b.Studio AND a.Title = b.Title
      LEFT JOIN Territory AS c on a.TerritoryID = c.TerritoryID
      GROUP BY Studio,Title,TerritoryID,Type

      SELECT * FROM总计
      ORDER BY Studio,Title,TerritoryID,Type

      输出结果如下

      Studio Title TerritoryID类型价格量
      Fox最佳放映计划英国电影87.32 18
      Fox Best Laid P兰斯GB电视剧集50.17 23
      福克斯最佳放映计划其他电视剧集1131.0 2
      福克斯最佳放映计划美国电影120.82 18
      福克斯最佳放映计划美国电视剧集53.76 24
      Fox Homecoming GB TV Episode 60.22 28
      Fox Homecoming其他电视节目2262.0 4
      Fox Homecoming美国电视剧128.45 58
      其他其他GB电影142.71 29
      其他其他其他GB电视剧84.8 40
      其他其他其他其他电影3292.0 4
      其他其他其他电视剧3282.0 16
      其他其他美国电影52.92 8
      其他其他美国电视剧233.05 101
      派拉蒙回望GB电影18.96 4
      派拉蒙回家美国电影124.84 16
      派拉蒙泰坦尼克GB电影41.92 8
      派拉蒙泰坦尼克号其他电影12.0 4
      派拉蒙泰坦尼克号美国电影139.84 16


      $ b

      您可以轻松地将其反馈给您的用户界面,以任何您需要的方式将其可视化。




      • 最终查询



      现在,不是所有相关维度中的硬编码值 - 让我们为每个维度实现实际的标准。

      所以下面的查询(vs上面的骨架查询)中唯一的变化在以下CTE中:工作室,标题和区域

      #standardSQL
      WITH Studios AS(
      SELECT DISTINCT Studio
      FROM yourTable
      ORDER BY Studio LIMIT 3
      ),
      标题AS(
      SELECT Studio,T itle
      FROM(
      )SELECT(Studio,Title,ROW_NUMBER()OVER)(作为Studio ORDER BY PRICE DESC的分区)作为pos
      FROM(SELECT Studio,Title,SUM(Price)AS Price FROM yourTable GROUP BY Studio,Title)
      )WHERE pos< = 4
      ),
      Territories AS(
      SELECT TerritoryID FROM yourTable
      WHERE Studio ='Paramount'GROUP BY TerritoryID
      ORDER BY COUNT(1)DESC LIMIT 2
      ),
      Totals AS(
      SELECT
      IFNULL(b.Studio,'Other')AS Studio,
      IFNULL(b.Title,'Other')AS标题,
      IFNULL(c.TerritoryID,'其他')AS TerritoryID,
      类型,
      ROUND(SUM(Price), 2)AS Price,COUNT(1)AS Volume
      FROM yourTable AS
      LEFT JOIN标题AS b ON a.Studio = b.Studio AND a.Title = b.Title
      LEFT JOIN领土AS c ON a.TerritoryID = c.TerritoryID
      GROUP BY Studio,Title,TerritoryID,Type

      SELECT * FROM总计
      Where'Other'IN(TerritoryID)
      ORDER BY Studio,T erritoryID DESC,Type,Price DESC,Title

      结果如下:

      演播室标题TerritoryID类型价格音量
      Fox最佳放映计划美国电影120.82 18
      Fox Titanic US Movie 52.92 8
      Fox 1:00 PM - 2:00 PM美国电视节目187.25 81
      Fox Homecoming美国电视节目128.45 58
      Fox最佳放映计划美国电视节目53.76 24
      Fox最佳放映计划GB电影87.32 18
      Fox Titanic GB电影78.84 16
      Fox 1:00 PM - 2:00 PM GB电视剧集61.42 28
      Fox Homecoming国语电视剧集60.22 28
      福克斯最佳放映计划英国电视剧集50.17 23
      派拉蒙泰坦尼克号美国电影139.84 16
      派拉蒙归乡美国电影124.84 16
      Paramount泰坦尼克号GB电影41.92 8
      Paramount Homecoming GB电影18.96 4
      索尼最佳放映计划美国电视剧22.9 10
      索尼Homecoming美国电视剧22.9 10
      Sony Best Laid计划GB Movie 63.87 13
      Sony Homecoming GB电视剧集18.81 9
      索尼最佳预定计划GB电视剧集4.57 3

      这里的要点是 - 而BigQuery在分析数十亿行和提取所需信息方面非常高效,它非常有效客户可以使用BigQuery来实际定制结果数据,以反映该结果如何在客户端UI上的表示层中实际呈现。相反,您只需将这些数据传递给用户界面并使用您的可视化代码来处理它即可。

      This is a follow-up question to Multi-level pivot in Google BigQuery, in which I wanted to know if it was possible to construct a nested pivot table in Google BigQuery using a single query. It is, and so in this follow-up question, I'd like to explore the general case.

      Here is an example of the data that I'm using (which is also included in this shared Google Sheet) :

      Now, I would like to build a pivot table that has the following properties:

      • Nested levels at both the row and col level (the previous question only had nested-cols)
      • Sub-totals within both the rows and cols (the previous only had a grand total)
      • Multiple metrics (the previous only had a single metric)
      • Multiple sorts -- by both deep metrics and by alphabetical (the previous did not have any sort conditions)
      • Limits (the previous did not have any limits at all)

      Here is the pivot built in Google Sheets --

      The conceptual SQL statement here would be:

      SELECT
          SUM(price),
          COUNT(price) 
      BROKEN DOWN BY
          Studio (row),
          Title (row)
          Territory ID (col),
          Type (col)
      SORTED/LIMITED BY
          Studio ==> A-Z, LIMIT 3,
          Title ==> SUM(price) in GRAND TOTAL DESC, LIMIT 4,
          Territory ID ==> COUNT(price) in Paramount TOTAL, LIMIT 2
          Type ==> A-Z, NO LIMIT
      

      I'm not sure how to conceptually show the Subtotals in, but we should be able to specify those for each of the broken-down-by fields.

      Is it possible to do the above in a single SQL statement in Google BigQuery? What would be the steps to generate it?

      解决方案

      Q. what if we do an aggregation and have 10M results? unless we are applying the limits, etc. in bigquery -- the amount of data transferred would take a tremendous amount …


      Let's clarify the challenge here:

      So usually, you would run something like below in back-end and pull result up to visualization tool (front-end) for further manipulations like sorts, limits, pivoting, etc.

      #standardSQL
      SELECT
        Studio, 
        Title, 
        TerritoryID,
        Type, 
        SUM(Price) AS Price, 
        COUNT(1) AS Volume
      FROM YourTable  
      GROUP BY Studio, Title, TerritoryID, Type   
      

      As you mentioned, such result in your case can easily produce 10M+ rows and you want to reduce size of it w/o affecting your ability to still present final data in your pivot/visualization in front-end


      A. Recommendation / Solution

      Below shows how to achieve this by applying sorts and limits on back-end (so result size is drastically reduced) w/o losing ability to do pivoting and still show totals, etc.

      Let’s get to final query by starting with simplified one

      • Initial query (skeleton)

      Let’s assume, based on known criteria, that we know in advance which Studios, Titles, Territories and Types should be selected
      In this case, below query will return desired data

      #standardSQL
      WITH Studios AS (
        SELECT 'Fox' 
        UNION ALL SELECT 'Paramouont' 
      ),
      Titles AS (
        SELECT 'Fox' AS Studio,'Best Laid Plans' AS Title
        UNION ALL SELECT 'Fox','Homecoming'
        UNION ALL SELECT 'Paramount','Titanic'
        UNION ALL SELECT 'Paramount','Homecoming'
      ),
      Territories AS (
        SELECT 'US' AS TerritoryID
        UNION ALL SELECT 'GB'
      ),
      Totals AS (
        SELECT 
          IFNULL(b.Studio,'Other') AS Studio, 
          IFNULL(b.Title,'Other') AS Title, 
          IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
          Type,
          ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
        FROM yourTable AS a 
        LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
        LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
        GROUP BY Studio, Title, TerritoryID, Type
      )
      SELECT * FROM Totals
      ORDER BY Studio, Title, TerritoryID, Type
      

      The output will be something as below

      Studio      Title           TerritoryID Type        Price    Volume  
      Fox         Best Laid Plans GB          Movie         87.32    18    
      Fox         Best Laid Plans GB          TV Episode    50.17    23    
      Fox         Best Laid Plans Other       TV Episode  1131.0      2    
      Fox         Best Laid Plans US          Movie        120.82    18    
      Fox         Best Laid Plans US          TV Episode    53.76    24    
      Fox         Homecoming      GB          TV Episode    60.22    28    
      Fox         Homecoming      Other       TV Episode  2262.0      4    
      Fox         Homecoming      US          TV Episode   128.45    58    
      Other       Other           GB          Movie        142.71    29    
      Other       Other           GB          TV Episode    84.8     40    
      Other       Other           Other       Movie       3292.0      4    
      Other       Other           Other       TV Episode  3282.0     16    
      Other       Other           US          Movie         52.92     8    
      Other       Other           US          TV Episode   233.05   101    
      Paramount   Homecoming      GB          Movie         18.96     4    
      Paramount   Homecoming      US          Movie        124.84    16    
      Paramount   Titanic         GB          Movie         41.92     8    
      Paramount   Titanic         Other       Movie         12.0      4    
      Paramount   Titanic         US          Movie        139.84    16   
      

      You can easily feed it back to your UI to visualize it in whatever way you need

      • "Final" query

      Now, instead of hard-coded values in all involved dimensions - let’s implement actual criteria(s) for each dimension.
      So the only changes in below query (vs above skeleton query) are in following CTEs: Studios, Titles, and Territories

      #standardSQL
      WITH Studios AS (
        SELECT DISTINCT Studio 
        FROM yourTable 
        ORDER BY Studio LIMIT 3
      ),
      Titles AS (
        SELECT Studio, Title 
        FROM (
          SELECT Studio, Title, ROW_NUMBER() OVER(PARTITION BY Studio ORDER BY PRICE DESC) AS pos
          FROM (SELECT Studio, Title, SUM(Price) AS Price FROM yourTable GROUP BY Studio, Title)
        ) WHERE pos <= 4
      ),
      Territories AS (
        SELECT TerritoryID FROM yourTable  
        WHERE Studio = 'Paramount' GROUP BY TerritoryID
        ORDER BY COUNT(1) DESC LIMIT 2
      ),
      Totals AS (
        SELECT 
          IFNULL(b.Studio,'Other') AS Studio, 
          IFNULL(b.Title,'Other') AS Title, 
          IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
          Type,
          ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
        FROM yourTable AS a 
        LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
        LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
        GROUP BY Studio, Title, TerritoryID, Type
      )
      SELECT * FROM Totals
      WHERE NOT 'Other' IN (TerritoryID)
      ORDER BY Studio, TerritoryID DESC, Type, Price DESC, Title
      

      The result here is:

      Studio      Title           TerritoryID Type        Price  Volume    
      Fox         Best Laid Plans         US  Movie       120.82  18   
      Fox         Titanic                 US  Movie        52.92   8   
      Fox         1:00 P.M. - 2:00 P.M.   US  TV Episode  187.25  81   
      Fox         Homecoming              US  TV Episode  128.45  58   
      Fox         Best Laid Plans         US  TV Episode   53.76  24   
      Fox         Best Laid Plans         GB  Movie        87.32  18   
      Fox         Titanic                 GB  Movie        78.84  16   
      Fox         1:00 P.M. - 2:00 P.M.   GB  TV Episode   61.42  28   
      Fox         Homecoming              GB  TV Episode   60.22  28   
      Fox         Best Laid Plans         GB  TV Episode   50.17  23   
      Paramount   Titanic                 US  Movie       139.84  16   
      Paramount   Homecoming              US  Movie       124.84  16   
      Paramount   Titanic                 GB  Movie        41.92   8   
      Paramount   Homecoming              GB  Movie        18.96   4   
      Sony        Best Laid Plans         US  TV Episode   22.9   10   
      Sony        Homecoming              US  TV Episode   22.9   10   
      Sony        Best Laid Plans         GB  Movie        63.87  13   
      Sony        Homecoming              GB  TV Episode   18.81   9   
      Sony        Best Laid Plans         GB  TV Episode    4.57   3       
      

      The point here is - while BigQuery is extremely efficient in analyzing billions of rows and extracting needed info, It is quite ineficient to use BigQuery to actually tailor result data to reflect how this result will actually be presented in presentation layer on client UI. Instead - you should just pass this data to UI and have your visualization code to handle it

      这篇关于在Google BigQuery中使用深度排序的通用数据透视表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆