加载的记录多于HIve中的实际记录 [英] Loading more records than actual in HIve

查看:73
本文介绍了加载的记录多于HIve中的实际记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从Hive表插入到HIve表时,正在加载比实际记录更多的记录.任何人都可以帮助解决Hive的这种怪异行为吗?

While inserting from Hive table to HIve table, It is loading more records that actual records. Can anyone help in this weird behaviour of Hive ?

我的查询将如下所示:

insert overwrite table_a
    select col1,col2,col3,... from table_b;

我的table_b包含6405465条记录.

My table_b consists of 6405465 records.

从table_b插入table_a后,我发现table_a中的总记录为6406565.

After inserting from table_b to table_a, i found total records in table_a are 6406565.

任何人都可以在这里帮忙吗?

Can any one please help here ?

推荐答案

如果hive.compute.query.using.stats=true;,则优化器将统计信息用于查询计算,而不是查询表数据.因为metastore是MySQL之类的快速数据库,并且不需要map-reduce,所以速度更快.但是,如果未使用INSERT OVERWRITE加载表或将负责统计信息自动收集的配置参数hive.stats.autogather设置为false,则统计信息可能不是最新的(陈旧的).此外,加载文件或使用第三方工具后,统计信息也不是最新的.这是因为从未对文件进行过分析,Metastore中的统计信息并不新鲜,如果放置了新文件,则没人会知道数据是如何更改的.同样,在sqoop加载之后,等等.因此,在加载后使用"ANALYZE TABLE ... COMPUTE STATISTICS"收集表或分区的统计信息是一个好习惯.

If hive.compute.query.using.stats=true; then optimizer is using statistics for query calculation instead of querying table data. This is much faster because metastore is a fast database like MySQL and does not require map-reduce. But statistics can be not fresh (stale) if the table was loaded not using INSERT OVERWRITE or configuration parameter hive.stats.autogather responsible for statistics auto gathering was set to false. Also statistics will be not fresh after loading files or after using third-party tools. It's because files was never analyzed, statistics in metastore is not fresh, if you have put new files, nobody knows about how the data was changed. Also after sqoop loading, etc. So, it's a good practice to gather statistics for table or partition after loading using 'ANALYZE TABLE ... COMPUTE STATISTICS'.

如果无法自动收集统计信息(适用于INSERT OVERWRITE)或通过运行ANALYZE语句,则最好关闭hive.compute.query.using.stats参数. Hive将查询数据而不是使用统计信息.

In case it's impossible to gather statistics automatically (works for INSERT OVERWRITE) or by running ANALYZE statement then better to switch off hive.compute.query.using.stats parameter. Hive will query data instead of using statistics.

请参阅以下内容以供参考: https://cwiki.apache. org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

See this for reference: https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-StatisticsinHive

这篇关于加载的记录多于HIve中的实际记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆