HIVE 选择计数(*) 非空返回比选择计数(*) 更高的值 [英] HIVE select count(*) non null returns higher value than select count(*)
问题描述
我目前正在使用 Hive 进行一些数据探索,无法解释以下行为.假设我有一个带有字段 master_id 的表(名为 mytable).
I am currently doing some data exploration with Hive and cannot explain the following behavior. Say I have a table (named mytable) with a field master_id.
当我计算我得到的行数
select count(*) as c from mytable
c
1129563
如果我想计算具有非空 master_id 的行数,我会得到更高的数字
If I want to count the number of row with a non null master_id, I get a higher number
select count(*) as c from mytable where master_id is not null
c
1134041
此外,master_id 似乎永远不会为空.
Additionally, the master_id seems to be never null.
select count(*) as c from mytable where master_id is null
c
0
我无法解释如何添加 where 语句最终会增加行数.有没有人有任何提示来解释这种行为?
I cannot explain how adding a where statement can increase the number of rows eventually. Does anyone have any hint to explain this behavior ?
谢谢
推荐答案
很可能你的查询没有 where is using statistics 因为设置了这个参数:
Most probably your query without where is using statistics because of this parameter is set:
set hive.compute.query.using.stats=true;
尝试将其设置为 false 并再次执行.
Try to set it false and execute again.
或者,您可以计算表的统计信息.请参阅分析表语法
Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX
还可以在 INSERT OVERWRITE 期间自动收集统计信息:
Also it's possible to gather statistics during INSERT OVERWRITE automatically:
set hive.stats.autogather=true;
这篇关于HIVE 选择计数(*) 非空返回比选择计数(*) 更高的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!