HIVE选择计数(*)非空返回的值高于选择计数(*) [英] HIVE select count(*) non null returns higher value than select count(*)

查看:873
本文介绍了HIVE选择计数(*)非空返回的值高于选择计数(*)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Hive进行一些数据探索,无法解释以下行为。假设我有一个带有字段master_id的表(名为mytable)。



当我计算我得到的行数时

  select count(*)as c from mytable 
c
1129563

如果我要计算具有非null master_id的行数,我会得到一个更高的数字

  select count(*)as mytable from mytable其中master_id不为空
c
1134041





  select count(*)as mytable from mytable其中master_id是空的
c
0

我无法解释如何添加where语句可以增加最终的行数。有没有人有任何解释这种行为的提示?



谢谢 >很可能你的查询没有使用统计数据,因为这个参数被设置:

  set hive.compute.query.using。统计= TRUE; 

尝试将其设置为false并再次执行。

或者,您可以在表格上计算统计信息。
请参阅 ANALYZE TABLE SYNTAX
a>



也可以在INSERT OVERWRITE自动收集统计信息:

  set hive.stats.autogather = true; 


I am currently doing some data exploration with Hive and cannot explain the following behavior. Say I have a table (named mytable) with a field master_id.

When I count the number of row I get

select count(*) as c from mytable 
c
1129563

If I want to count the number of row with a non null master_id, I get a higher number

select count(*) as c from mytable where master_id is not null
c
1134041

Additionally, the master_id seems to be never null.

select count(*) as c from mytable where master_id is null
c
0

I cannot explain how adding a where statement can increase the number of rows eventually. Does anyone have any hint to explain this behavior ?

Thanks

解决方案

Most probably your query without where is using statistics because of this parameter is set:

set hive.compute.query.using.stats=true;

Try to set it false and execute again.

Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX

Also it's possible to gather statistics during INSERT OVERWRITE automatically:

set hive.stats.autogather=true;

这篇关于HIVE选择计数(*)非空返回的值高于选择计数(*)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆