BigQuery 中 count(*) 的值是如何确定的? [英] How exactly is the value of count(*) determined in BigQuery?

查看:22
本文介绍了BigQuery 中 count(*) 的值是如何确定的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过内连接将一个大约 70000 行的表与一个稍大的第二个表连接起来.现在 count(a.business_column) 和 count(*) 给出不同的结果.前者正确报告了 ~70000,而后者给出了 ~200000.但这仅在我单独选择 count(*) 时发生,当我一起选择它们时,它们会给出相同的结果(~70000).这怎么可能?

选择数数(*)/*,计数(a.business_column)*/从表_a内连接每个表_b bb.key_column = a.business_column

解决方案

更新: 有关其工作原理的逐步说明,请参阅 BigQuery 在使用与重复字段同名的字段时会展平 改为.

<小时>

回答标题问题:BigQuery 中的 COUNT(*) 总是准确的.

需要注意的是,在 SQL 中 COUNT(*) 和 COUNT(column) 在语义上具有不同的含义 - 并且可以以不同的方式解释示例查询.

参见:http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/

他们有这个示例查询:

select user.userid, count(email.subject)来自用户user.userid = email.userid 上的内部连接电子邮件按 user.userid 分组;

该查询结果不明确,文章作者将其更改为更明确的查询,并添加以下评论:

<块引用>

但是如果查询的作者不是这个意思怎么办?没有真正知道的方式.有几种可能的预期含义查询,并且有几种不同的方法可以将查询写入更清楚地表达这些含义.但原始查询是模棱两可,有几个原因.以及阅读此查询的每个人之后最终会猜测原作者的意思.一世我觉得我可以放心地把它改成……"

<小时>

更新: 有关其工作原理的分步说明,请参阅 BigQuery 在使用与重复字段同名的字段时会展平而是.

I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?

select
   count(*)
   /*,count(a.business_column)*/

from table_a a
inner join each table_b b
   on b.key_column = a.business_column

解决方案

UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.


To answer the title question: COUNT(*) in BigQuery is always accurate.

The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.

See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/

There they have this sample query:

select user.userid, count(email.subject)
from user
   inner join email on user.userid = email.userid
group by user.userid;

That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:

But what if that’s not what the author of the query meant? There’s no way to really know. There are several possible intended meanings for the query, and there are several different ways to write the query to express those meanings more clearly. But the original query is ambiguous, for a few reasons. And everyone who reads this query afterwards will end up guessing what the original author meant. "I think I can safely change this to…"


UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.

这篇关于BigQuery 中 count(*) 的值是如何确定的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆