BigQuery创建来自查询的重复记录字段 [英] BigQuery creat repeated record field from query

查看:117
本文介绍了BigQuery创建来自查询的重复记录字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在BigQuery中创建重复的记录列?例如,对于以下数据:

  | a | b | c | 
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |

以下可能吗?

 选择a,NEST(b,c)作为* from * table * group中的d by a 

产生以下结果

  | a | d.b | d.c | 
-----------------
| 1 | 5 | 2 |
-----------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |


解决方案

c> NEST()嵌套的限制只有一个字段是使用 BigQuery用户定义的函数。它们非常强大,但仍有一些限制限制要注意。从我的预期来看,最重要的一点是 - 他们非常适合作为昂贵的高计算查询


相对于处理的字节数,复杂查询会消耗非常大的计算资源
。通常情况下,这样的查询
包含大量的JOIN或CROSS JOIN子句或复杂的
用户定义函数。



  SELECT a,所以,下面是一个例子,它模仿了questino中的NEST(b,c) db,dc FROM 
JS((//输入表
SELECT a,NEST(CONCAT(STRING(b),',',STRING(c)))AS d
FROM(
SELECT * FROM
(SELECT 1 AS a,5 AS b,2 AS c),
(SELECT 1 AS a,3 AS b,1 AS c),
(SELECT 2 AS a,2 AS b,1 AS c)
)GROUP BY a),
a,d,//输入栏
[{'name':'a','type ':'INTEGER'},//输出模式
{'name':'d','type':'RECORD',
'mode':'REPEATED',
'fields ':[
{'name':'b','type':'STRING'},
{'name':'c','type':'STRING'}
]
}
],
function(row,emit){//函数
var c = [];
for(var i = 0; i x = row.d [i] .toString()。分裂(',');
t = {b:x [0],c:x [1]}
c.push(t);
};
emit({a:row.a,d:c});
}

这是比较简单的,我希望你会能够通过它并得到一个想法



仍然 - 记住:


无论您如何使用嵌套/重复字段创建记录 - BigQuery
自动将查询结果展平,因此可见结果将不包含
重复字段。因此,您应该将其用作产生
的子查询中间结果由相同的查询立即使用。

作为供参考,你可以证明自己,上述回报只有两个记录(不是三因为它看起来像是扁平的时候)通过运行下面的查询

  SELECT COUNT(1)AS rows FROM(
<

另一个重要的注意: strong>

这是一个已知 NEST() UnFlatten Results 不兼容的输出并主要用于中间结果在子查询中。
相比之下,上述解决方案可以很容易地直接保存到表格中(带有未经检查的展平结果)

Is it possible to create a repeated record column in BigQuery? For example, for the following data:

| a | b | c |
-------------
| 1 | 5 | 2 |
-------------
| 1 | 3 | 1 |
-------------
| 2 | 2 | 1 |

Is the following possible?

Select a, NEST(b, c) as d from *table* group by a

To produce the following results

| a | d.b | d.c |
-----------------
| 1 |  5  |  2  |
-----------------
|   |  3  |  1  |
-----------------
| 2 |  2  |  1  |

解决方案

One of the way to go around NEST() limitation of "nesting" just one field is to use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries

Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.

So, below is example that "mimic" NEST(b, c) from example in questino:

SELECT a, d.b, d.c FROM 
JS((      // input table
  SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
  FROM (
    SELECT * FROM 
    (SELECT 1 AS a, 5 AS b, 2 AS c),
    (SELECT 1 AS a, 3 AS b, 1 AS c),
    (SELECT 2 AS a, 2 AS b, 1 AS c)
  ) GROUP BY a),
  a, d,     // input columns
  "[{'name': 'a', 'type': 'INTEGER'},    // output schema
    {'name': 'd', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'b', 'type': 'STRING'},
       {'name': 'c', 'type': 'STRING'}
     ]    
    }
  ]",
  "function(row, emit){    // function 
    var c = [];
    for (var i = 0; i < row.d.length; i++) {
      x = row.d[i].toString().split(',');
      t = {b:x[0], c:x[1]}
      c.push(t);
    };
    emit({a: row.a, d: c});  
  }"
)

It is relatively straightforward. I hope you will be able to walk through it and get an idea

Still - remember:

No matter how you create record with nested/repeated fields - BigQuery automatically flattens query results, so visible results won't contain repeated fields. So you should use it as a subselect that produces intermediate results for immediate use by the same query.

As FYI, you can prove for yourself that above returns only two records (not three as it is looks like when it is flattened) by running below query

SELECT COUNT(1) AS rows FROM (
  <above query here>
) 

Another important NOTE:
This is a known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.
In contrast, above solution can be easily saved directly to table (with unchecked Flatten Results)

这篇关于BigQuery创建来自查询的重复记录字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆