使用与重复字段同名的字段时,BigQuery 会变平 [英] BigQuery flattens when using field with same name as repeated field

查看:17
本文介绍了使用与重复字段同名的字段时,BigQuery 会变平的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编辑为使用公共数据集

我有一个具有以下架构的表,您可以在此处访问:

如果我运行以下查询,cnt1cnt2 会得到不同的结果.

SELECTCOUNT(*) 作为 cnt1,COUNT(dr_id) 作为 cnt2,FROM (SELECT * FROM rs_public.test_count) AS tc在哪里tc.is_published

如果我从 where 子句中删除 tc 别名,我会得到两个计数相同的结果:

SELECTCOUNT(*) 作为 cnt1,COUNT(dr_id) 作为 cnt2,FROM (SELECT * FROM rs_public.test_count) AS tc在哪里is_published

但是,如果我重复第一个查询,但使用 where 子句中的 is_claimed 字段,我会再次得到相同的计数.

SELECTCOUNT(*) 作为 cnt1,COUNT(dr_id) 作为 cnt2,FROM (SELECT * FROM rs_public.test_count) AS tc在哪里tc.is_claimed

我认为这是一个错误,BigQuery 感到困惑,因为 is_published 是一个外部字段,也是 cover_photos 记录的叶字段.在评估是否应展平结果时,错误地使用了 cover_photos.is_published 字段 - 但在根据 where 子句过滤结果时使用了外部 is_published 字段.><小时>

这是不使用 select * 的反例,我在下面对 Felipe 的回答的评论中引用了它:

SELECT数数(*)从 (选择博士身份,cover_photos.is_published从[realself-main:rs_public.test_count] )

返回 3.

SELECT计数(*),计数(0)从 (选择博士身份,cover_photos.is_published从[realself-main:rs_public.test_count] )

返回 7 和 3!根据我的评论,似乎唯一安全的选择是永远不要使用 count(*)

解决方案

我正在添加一个新答案,因为您不断向问题添加元素 - 它们都应该得到不同的答案.

你说这个查询让你感到惊讶:

SELECT COUNT(*), COUNT(0)从 (选择 dr_id,cover_photos.is_published来自 [realself-main:rs_public.test_count] )

你很惊讶,因为结果是 7 和 3.

如果我试试这个也许会有意义:

SELECT COUNT(*), COUNT(0),GROUP_CONCAT(STRING(cover_photos.is_published)),GROUP_CONCAT(STRING(dr_id)),GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null')),GROUP_CONCAT("0")从 (选择 dr_id,cover_photos.is_published来自 [realself-main:rs_public.test_count])

看到了吗?这是同一个查询,加上4个相同子列的不同聚合,其中一个由嵌套的重复数据组成,并且在一行中还有一个空值.

查询结果为:

7 3 1,1,1,0,0,0 1234,4321,9999 null,1,1,1,0,0,0 0,0,0

7 来自将嵌套数据完全扩展为 7 行,如第 5 列提示.3 来自仅对0"求值三次,如第 6 列所示.

这些微妙之处都与处理嵌套重复数据有关.我建议您不要使用嵌套重复数据,直到您准备好接受使用嵌套重复数据时可能会发生这些微妙之处.

Edited to use public dataset

I have a table with the following schema, which you can access here: https://bigquery.cloud.google.com/table/realself-main:rs_public.test_count

If I run the following query, I get a different result for cnt1 vs. cnt2.

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  tc.is_published

If I remove the tc alias from the where clause, I get the same result for both counts:

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  is_published

If, however, I repeat the first query but use the is_claimed field in the where clause instead, I get the same count again.

SELECT 
  COUNT(*) AS cnt1,
  COUNT(dr_id) as cnt2,
FROM (SELECT * FROM rs_public.test_count) AS tc
WHERE
  tc.is_claimed

I think this is a bug and BigQuery is getting confused because is_published is an outer field and also a leaf field of the cover_photos record. It is incorrectly using the cover_photos.is_published field when evaluating whether the results should be flattened -- but using the outer is_published field when filtering the results per the where clause.


Here's the counter-example that doesn't use select *, which I reference in my comment on Felipe's answer below:

SELECT
  COUNT(*)
FROM (
  SELECT
    dr_id,
    cover_photos.is_published
  FROM
    [realself-main:rs_public.test_count] )

returns 3.

SELECT
  COUNT(*), COUNT(0)
FROM (
  SELECT
    dr_id,
    cover_photos.is_published
  FROM
    [realself-main:rs_public.test_count] )

returns 7 and 3! Per my comment, it seems like the only safe option is never to use count(*)

解决方案

I'm adding a new answer, as you keep adding elements to the question - they all deserve a different answer.

You say this query surprises you:

SELECT COUNT(*), COUNT(0)
FROM (
  SELECT dr_id, cover_photos.is_published
  FROM [realself-main:rs_public.test_count] )

You are surprised because the results are 7 and 3.

Maybe it will make sense if I try this:

SELECT COUNT(*), COUNT(0), 
       GROUP_CONCAT(STRING(cover_photos.is_published)),
       GROUP_CONCAT(STRING(dr_id)), 
       GROUP_CONCAT(IFNULL(STRING(cover_photos.is_published),'null')),
       GROUP_CONCAT("0")
FROM (
  SELECT dr_id, cover_photos.is_published
  FROM [realself-main:rs_public.test_count] 
)

See? It's the same query, plus 4 different aggregations of the same sub-columns, one of which consists of nested repeated data, and that also has a null value in one row.

The results of the query are:

7   3   1,1,1,0,0,0 1234,4321,9999  null,1,1,1,0,0,0    0,0,0

The 7 comes from the full expansion of the nested data into 7 rows, as the 5th column hints. The 3 comes from just evaluating "0" three times, as can be seen on the 6th column.

These subtleties are all related to working with nested repeated data. I'll advise you to not work with nested repeated data until you are ready to accept that these subtleties can happen when working with nested repeated data.

这篇关于使用与重复字段同名的字段时,BigQuery 会变平的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆