配置单元 - 分区上的查询不会返回任何内容 [英] Hive - Queries on Partitions return nothing

查看:116
本文介绍了配置单元 - 分区上的查询不会返回任何内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个正在被特定开始日期(ds)分区的表格。我可以查询最新的分区(前一天的数据),它将使用分区罚款。

  hive>从vtc4中选择count(1),其中ds ='2012-11-01'; 
...垃圾...
MapReduce作业推出:
作业0:映射:1减少:1累计CPU:6.43秒HDFS读取:46281957 HDFS写入:7 SUCCESS
Total MapReduce CPU使用时间:6秒430毫秒
确定
151225
拍摄时间:35.007秒

但是,当我尝试查询较早的分区时,配置单元似乎读取分区正常,但不返回任何结果。

 蜂房>从vtc4中选择count(1),其中ds ='2012-10-31'; 
...垃圾...
MapReduce作业推出:
作业0:映射:1减少:1累积CPU:7.64秒HDFS读取:37754168 HDFS写入:2 SUCCESS
Total MapReduce CPU使用时间:7秒640毫秒
确定
0
拍摄时间:29.07秒

但是,如果我告诉配置单元在表本身内部的日期字段中运行查询,并且不使用分区,那么我会得到正确的结果。

  hive>从vtc4中选择count(1),其中date_started> =2012-10-31 00:00:00和date_started< 2012-11-01 00:00:00; 
...垃圾...
MapReduce作业推出:
作业0:映射:63减少:1累计CPU:453.52秒HDFS阅读:16420276606 HDFS写入:7 SUCCESS
Total MapReduce CPU使用时间:7分33秒520毫秒
OK
123201
拍摄时间:265.874秒

我在这里错过了什么?我运行的是hadoop 1.03和hive 0.9。我对hive / hadoop非常陌生,所以任何帮助都将不胜感激。



谢谢。

<编辑1:
hive>描述格式化的vtc4分区(ds ='2012-10-31');

 分区值:[2012-10-31] 
数据库:默认
表:vtc4
创建时间:Wed Oct 31 12:02:24 PDT 2012
LastAccessTime :UNKNOWN
保护模式:无
位置:hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
分区参数:
transient_lastDdlTime 1351875579

#存储信息
SerDe库:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.TextInputFormat
OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
压缩:否
数量桶:-1
桶列: []
Sort Columns:[]
Storage Desc Params:
serialization.format 1
所需时间:0.191秒

存在分区文件夹,但当我尝试在hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=上执行hadoop fs -ls时它说文件/目录不存在。如果我使用Web界面浏览该目录,则可以进入该文件夹,以及查看/ part-m-000 *文件。如果我在hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01上做了一个fs -ls,它可以正常工作。

解决方案

看起来好像是权限的东西,或者是配置单元或namenode的元数据。以下是我会尝试的:


  1. 将该分区中的数据复制到hdfs中的其他位置。您可能需要将其作为hive或hdfs用户执行此操作,具体取决于您的权限设置方式。

  2. alter table vtc4 drop partition(ds ='2012- 10-31');

  3. alter table vtc4 add partition(ds ='2012-10-31');

  4. 将数据复制回hdfs上的该分区


I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.

hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.43 sec   HDFS Read: 46281957 HDFS Write:  7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds

However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.

hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.64 sec   HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds

However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.

hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63  Reduce: 1   Cumulative CPU: 453.52 sec   HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds

What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.

Thanks.

EDIT 1: hive> describe formatted vtc4 partition (ds='2012-10-31');

Partition Value:        [2012-10-31 ]
Database:               default
Table:                  vtc4
CreateTime:             Wed Oct 31 12:02:24 PDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Location:               hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
    transient_lastDdlTime   1351875579

# Storage Information 
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1
Time taken: 0.191 seconds

The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.

解决方案

Seems like either a permissions thing, or something funky with the either hive's or the namenode's metadata. Here's what I would try:

  1. copy the data in that partition to some other location in hdfs. You may need to do this as the hive or hdfs user, depending on how your permissions are set up.
  2. alter table vtc4 drop partition (ds='2012-10-31');
  3. alter table vtc4 add partition (ds='2012-10-31');
  4. copy the data back into that partition on hdfs

这篇关于配置单元 - 分区上的查询不会返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆