运行"count(*)"时行为上的差异在Tez和Map减少 [英] Diffrence in behaviour while running "count(*) " in Tez and Map reduce

查看:165
本文介绍了运行"count(*)"时行为上的差异在Tez和Map减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我遇到了这个问题.我在Hadoop分布式文件系统路径和相关的配置单元表中有一个文件.桌子两边有30个隔断.

Recently I came across this issue. I had a file at a Hadoop Distributed File System path and related hive table. The table had 30 partitions on both sides.

我从HDFS删除了5个分区,然后在配置单元表上执行了"msck repair table <db.tablename>;".它完成得很好,但是输出了

I deleted 5 partitions from HDFS and then executed "msck repair table <db.tablename>;" on the hive table. It completed fine but outputted

文件系统缺少分区:"

"Partitions missing from filesystem:"

我尝试运行select count(*) <db.tablename>;(在tez上)失败,并出现以下错误:

I tried running select count(*) <db.tablename>; (on tez) it failed with the following error:

由以下原因引起:java.util.concurrent.ExecutionException: java.io.FileNotFoundException:

Caused by: java.util.concurrent.ExecutionException: java.io.FileNotFoundException:

但是当我将hive.execution.engine设置为"mr"并执行"select count(*) <db.tablename>;"时,它运行正常,没有任何问题.

But when I set hive.execution.engine as "mr" and executed "select count(*) <db.tablename>;" it worked fine without any issue.

我现在有两个问题:

  1. 这怎么可能?

  1. How is this is possible?

如何同步配置单元metastore和hdfs分区?为了 以上情况.(我的蜂巢版本是"Hive 1.2.1000.2.6.5.0-292".)

How can I sync the hive metastore and an hdfs partition? For the above case .(My hive version is " Hive 1.2.1000.2.6.5.0-292 ".)

预先感谢您的帮助.

推荐答案

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

这会将有关分区的元数据更新到Hive元存储中,以获取尚不存在此类元数据的分区. MSC命令的默认选项是添加分区".使用此选项,它将把HDFS上存在但元存储中不存在的所有分区添加到元存储中. DROP PARTITIONS选项将从已经从HDFS中删除的metastore中删除分区信息. SYNC PARTITIONS选项等效于调用ADD和DROP PARTITIONS.

This will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn't already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS.

但是,仅Hive 3.0版提供此功能.请参阅- HIVE-17824

However, this is available only from Hive version 3.0.. See - HIVE-17824

在您的情况下,版本为Hive 1.2,以下是同步Metastore中的HDFS分区和表分区的步骤.

In your case, the version is Hive 1.2, below are the steps to sync the HDFS Partitions and Table Partitions in Metastore.

  1. 使用下面的ALTER语句直接从HDFS中删除相应的5个分区.
  1. Drop the corresponding 5 partitions those have been removed by you from HDFS directly, using the below ALTER statement .

ALTER TABLE <db.table_name> DROP PARTITION (<partition_column=value>);

  1. 运行SHOW PARTITIONS <table_name>;并查看分区列表是否刷新.
  1. Run SHOW PARTITIONS <table_name>; and see if the list of partitions are refreshed.

这应该像在HDFS中一样在HMS中同步分区.

This should sync the partitions in HMS as in HDFS.

或者,您可以删除并重新创建表(如果它是EXTERNAL表),则对新创建的表执行MSCK REPAIR.因为删除外部表不会删除基础数据.

Alternatively, you can drop and recreate the table (IF it is an EXTERNAL table), perform MSCK REPAIR on the newly created table. Because dropping an external table will not delete the underlying data.

注意:默认情况下,MSCK REPAIR只会将HDFS中新添加的分区添加到Hive Metastore中,而不会从Hive Metastore中删除那些已在HDFS中手动删除的分区.

Note: By default, MSCK REPAIR will only add newly added partitions in HDFS to Hive Metastore and does not delete the Partitions from Hive Metastore those have been deleted in HDFS manually.

====

为避免将来发生这些步骤,最好直接使用Hive中的ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>)删除分区.

To avoid these steps in future, it is good to delete the partitions directly using ALTER TABLE <table_name> DROP PARTITION (<partition_column=value>) from Hive.

这篇关于运行"count(*)"时行为上的差异在Tez和Map减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆