使用与重复字段名称相同的字段时,BigQuery变平 [英] BigQuery flattens when using field with same name as repeated field

查看:98
本文介绍了使用与重复字段名称相同的字段时,BigQuery变平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑使用公共数据集

我有一个表格,可以访问以下模式此处:



如果我运行在查询之后,我得到了 cnt1 cnt2 不同的结果。

  SELECT 
COUNT(*)AS cnt1,
COUNT(dr_id)as cnt2,
FROM(SELECT * FROM rs_public.test_count)AS tc
WHERE
tc .is_published



如果我从where子句中删除了 tc 别名,那么对于这两种计数我都会得到相同的结果:

  SELECT 
COUNT(*)AS cnt1,
COUNT(dr_id)as cnt2,
FROM(SELECT * FROM rs_public.test_count)AS tc
WHERE
is_published



然而,如果我重复第一个查询,但是使用where子句中的 is_claimed 字段,我会再次获得相同的计数。

  SELECT 
COUNT(*)AS cnt1,
COUNT(dr_id)as cnt2,
FROM(SELECT * FROM rs_public.test_count) AS tc
WHERE
tc.is_claimed



我认为这是一个错误,BigQuery越来越困惑,因为 is_published 是外部字段,也是 cover_photos 记录的叶字段。在评估结果是否应该平展时,错误地使用 cover_photos.is_published 字段,而在根据where子句过滤结果时使用外部 is_published 字段。 p>




下面是不使用select *的反例,我在下面对Felipe的回答的评论中引用: (b)

$









$ dr $ id
cover_photos.is_published
FROM
[realself-main:rs_public.test_count])

返回3.

  SELECT 
COUNT(*),COUNT(0)
FROM
SELECT
dr_id,
cover_photos.is_published
FROM
[realself-main:rs_public.test_count])

会返回7和3!根据我的评论,它似乎是唯一安全的选择是从来没有使用计数(*)

解决方案

我添加一个新的回答,因为你不断给这个问题添加元素 - 他们都应该得到不同的答案。



你说这个查询令你惊讶:

  SELECT COUNT(*),COUNT(0)
FROM(
SELECT dr_id,cover_photos.is_published
FROM [realself-main:你很惊讶,因为结果是7和3。$ / $ p





也许这是有道理的,如果我试试这个:

  SELECT COUNT(*), COUNT(0),
GROUP_CONCAT(STRING(cover_photos.is_published)),
GROUP_CONCAT(STRING(dr_id)),
GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null') ),
GROUP_CONCAT(0)
FROM(
SELECT dr_id,cover_photos.is_published
FROM [realself-main:rs_public.test_count]

请参阅?这是相同的查询,加上4个不同的相同子列的聚合,其中一个由嵌套的重复数据组成,并且在一行中也有一个空值。



查询的结果是:

  7 3 1,1,1,0,0,0 1234,4321,9999 null,1,1,1,0,0,0,0,0,0 

7来从嵌套数据的完整扩展到7行,作为第5列提示。这3个来自三次评估0,可以在第6列看到。

这些微妙之处都与使用嵌套重复数据有关。我建议您不要使用嵌套的重复数据,直到您准备好接受使用嵌套重复数据时可能发生的这些微妙情况。


Edited to use public dataset

I have a table with the following schema, which you can access here: https://bigquery.cloud.google.com/table/realself-main:rs_public.test_count

If I run the following query, I get a different result for cnt1 vs. cnt2.

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  tc.is_published

If I remove the tc alias from the where clause, I get the same result for both counts:

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  is_published

If, however, I repeat the first query but use the is_claimed field in the where clause instead, I get the same count again.

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  tc.is_claimed

I think this is a bug and BigQuery is getting confused because is_published is an outer field and also a leaf field of the cover_photos record. It is incorrectly using the cover_photos.is_published field when evaluating whether the results should be flattened -- but using the outer is_published field when filtering the results per the where clause.


Here's the counter-example that doesn't use select *, which I reference in my comment on Felipe's answer below:

SELECT
  COUNT(*)
FROM (
  SELECT
    dr_id,
    cover_photos.is_published
  FROM
    [realself-main:rs_public.test_count] )

returns 3.

SELECT
  COUNT(*), COUNT(0)
FROM (
  SELECT
    dr_id,
    cover_photos.is_published
  FROM
    [realself-main:rs_public.test_count] )

returns 7 and 3! Per my comment, it seems like the only safe option is never to use count(*)

解决方案

I'm adding a new answer, as you keep adding elements to the question - they all deserve a different answer.

You say this query surprises you:

SELECT COUNT(*), COUNT(0)
FROM (
  SELECT dr_id, cover_photos.is_published
  FROM [realself-main:rs_public.test_count] )

You are surprised because the results are 7 and 3.

Maybe it will make sense if I try this:

SELECT COUNT(*), COUNT(0), 
       GROUP_CONCAT(STRING(cover_photos.is_published)),
       GROUP_CONCAT(STRING(dr_id)), 
       GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null')),
       GROUP_CONCAT("0")
FROM (
  SELECT dr_id, cover_photos.is_published
  FROM [realself-main:rs_public.test_count] 
)

See? It's the same query, plus 4 different aggregations of the same sub-columns, one of which consists of nested repeated data, and that also has a null value in one row.

The results of the query are:

7   3   1,1,1,0,0,0 1234,4321,9999  null,1,1,1,0,0,0    0,0,0

The 7 comes from the full expansion of the nested data into 7 rows, as the 5th column hints. The 3 comes from just evaluating "0" three times, as can be seen on the 6th column.

These subtleties are all related to working with nested repeated data. I'll advise you to not work with nested repeated data until you are ready to accept that these subtleties can happen when working with nested repeated data.

这篇关于使用与重复字段名称相同的字段时,BigQuery变平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆