无法查询与elephant-bird配合使用的示例AddressBook protobuf数据 [英] Cannot query example AddressBook protobuf data in hive with elephant-bird

查看:295
本文介绍了无法查询与elephant-bird配合使用的示例AddressBook protobuf数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用大象鸟来查询一些protobuf数据的例子。我正在使用AddressBook示例,并将少量伪造的AddressBook序列化为文件,并将它们放在/ user / foo / data / elephant-bird / addressbooks /下的hdfs中。查询返回结果



我设置表并查询,如下所示:

 
add jar / home / foo / downloads / elephant-bird /hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar;

创建外部表格地址
行格式serdecom.twitter.elephantbird.hive.serde.ProtobufDeserializer
with serdeproperties(
serialization.class= com.twitter.data.proto.tutorial.AddressBookProtos $ AddressBook)
STORED AS
- elephant-bird提供了一个输入格式,用于配置单元
INPUTFORMATcom.twitter.elephantbird。 mapred.input.DeprecatedRawMultiInputFormat
- 占位符,因为我们不会写入此表
OUTPUTFORMATorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
LOCATION'/ user /富/数据/象鸟/ addressbooks /';


描述格式化地址;


#col_name data_type comment

person array {struct {name:string,id:int,email:string,phone:array {struct {number:字符串,类型:字符串}}}}来自反序列化器
byte解析器的二进制数据

#详细表信息
数据库:默认
拥有者:foo
CreateTime :Tue Oct 28 13:49:53 PDT 2014
LastAccessTime:UNKNOWN
保护模式:无
保留:0
位置:hdfs:// foo:8020 / user / foo / data / elephant-bird / addressbooks
表类型:EXTERNAL_TABLE
表参数:
EXTERNAL TRUE
transient_lastDdlTime 1414529393

#存储信息
SerDe库:com.twitter.elephantbird.hive.serde.ProtobufDeserializer
InputFormat:com.twitter.elephantbird.mapred.input.DeprecatedRawMu ltiInputFormat
OutputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
压缩:否
数量桶:-1
桶列:[]
排序列:[]
Storage Desc Params:
serialization.class com.twitter.data.proto.tutorial.AddressBookProtos $ AddressBook
serialization.format 1
需要的时间:0.421秒,提取:29行

当我尝试选择数据时,它不会返回任何结果(不会显示为读取行):

 
从地址中选择count(*);

总MapReduce作业= 1
启动作业1满分1
编译时确定的减少任务数量:1
为了改变平均负载减速器(以字节为单位):
set hive.exec.reducers.bytes.per.reducer =
为了限制减速器的最大数目:
set hive.exec.reducers.max =
为了设置一个固定数量的简化器:
set mapred.reduce.tasks =
Starting Job = job_1413311929339_0061,跟踪URL = http:// foo:8088 / proxy / application_1413311929339_0061 /
Kill Command = / usr / lib / hadoop / bin / hadoop job -kill job_1413311929339_0061
Stage-1的Hadoop作业信息:mappers的数量:0;第一阶段地图= 0%,减少= 0%
2014-10-28 13:50:51,055阶段1地图= 0%,减少= 0%
2014-10-28 13:50:37,674阶段1地图= 0%,减少= 100%,累计CPU 2.14秒
2014-10-28 13:50:52,152第一阶段地图= 0%,减少= 100%,累计CPU 2.14秒
MapReduce总累积CPU时间:2秒140毫秒
完成作业= job_1413311929339_0061
MapReduce作业启动:
作业0:减少:1累积CPU:2.14秒HDFS读取:0 HDFS写入:2 SUCCESS
Total MapReduce CPU使用时间:2秒140毫秒
确定
0
所需时间:37.519秒,提取:1行

如果我创建非外部表或者将数据显式导入外部表,我会看到相同的结果。



版本我的设置信息:

 
Thrift 0.7
protobuf:libprotoc 2.5.0
hadoop:
Hadoop 2.5.0-cdh5.2.0
Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387
由jen编译在2014-10-11T21:00Z
编译与protoc 2.5.0
从源码校验和309bccd135b199bdfdd6df5f3f4153d

更新:



我在日志中看到此错误。我在HDFS中的数据只是原始的protobuf(不压缩)。我想弄清楚是否这是问题,如果我可以读取原始二进制protobuf。

 

错误:java .io.IOException:java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive。 io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346)
at org.apache.hadoop。 hive.shims.HadoopShimsSecure $ CombineFileRecordReader。(HadoopShimsSecure.java:293)
at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407)
at org.apache .hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560)
at org.apache.hadoop.mapred.MapT请问$ TrackedRecordReader。(MapTask.java:168)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
at org.apache.hadoop.mapred.MapTask.run (MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
位于org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
位于org.apache.javax.security.auth.Subject.doAs(Subject.java:415)
。 hadoop.mapred.YarnChild.main(YarnChild.java:162)
导致:java.lang.reflect.InvocationTargetException $ b $在sun.reflect.NativeConstructorAccessorImpl.newInstance0(本地方法)
在太阳.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java :526 )
at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332)
... 11 more
导致:java.io.IOException:无文件hdfs:// foo:8020 / user / foo / data / elephantbird / addressbooks / 1000AddressBooks-1684394246.bin的编解码器在com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat中找到
(MultiInputFormat.java :176)
at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java :36)
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper $ RecordReaderWrapper。(DeprecatedInputFormatWrapper.java:256)
at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper。 java:121)
at com.twitter.elephantbird.mapred.input.DeprecatedFileIn putFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader。(CombineHiveRecordReader.java:65)
... 16 more


解决方案

你解决了这个问题吗?

<



是的,我发现原始的二进制protobuf不能直接读取。

我的问题与您所描述的相同。

这是我所问的问题。
使用带有蜂巢的大象鸟来读取protobuf数据



希望有帮助



最好的祝愿


I'm trying to use elephant bird to query some example protobuf data. I'm using the AddressBook example, and I serialized a handful of fake AddressBooks into files and put them in hdfs under /user/foo/data/elephant-bird/addressbooks/ The query returns no results

I setup the table and query like so:

add jar /home/foo/downloads/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar;
add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar;

create external table addresses
row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
with serdeproperties (
"serialization.class"="com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook")
STORED AS
-- elephant-bird provides an input format for use with hive
INPUTFORMAT "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
-- placeholder as we will not be writing to this table
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION '/user/foo/data/elephant-bird/addressbooks/';


describe formatted addresses;

OK
# col_name              data_type               comment

person array{ struct{  name:string, id:int, email:string, phone:array {struct {number:string, type:string}}}}  from deserializer
byteData                binary                  from deserializer

# Detailed Table Information
Database:               default
Owner:                  foo
CreateTime:             Tue Oct 28 13:49:53 PDT 2014
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://foo:8020/user/foo/data/elephant-bird/addressbooks
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        transient_lastDdlTime   1414529393

# Storage Information
SerDe Library:          com.twitter.elephantbird.hive.serde.ProtobufDeserializer
InputFormat:            com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.class     com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook
        serialization.format    1
Time taken: 0.421 seconds, Fetched: 29 row(s)

When I try to select data, it returns no results (doesn't appear to read rows):

select count(*) from addresses;

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapred.reduce.tasks=
Starting Job = job_1413311929339_0061, Tracking URL = http://foo:8088/proxy/application_1413311929339_0061/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1413311929339_0061
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1
2014-10-28 13:50:37,674 Stage-1 map = 0%,  reduce = 0%
2014-10-28 13:50:51,055 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 2.14 sec
2014-10-28 13:50:52,152 Stage-1 map = 0%,  reduce = 100%, Cumulative CPU 2.14 sec
MapReduce Total cumulative CPU time: 2 seconds 140 msec
Ended Job = job_1413311929339_0061
MapReduce Jobs Launched:
Job 0: Reduce: 1   Cumulative CPU: 2.14 sec   HDFS Read: 0 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 140 msec
OK
0
Time taken: 37.519 seconds, Fetched: 1 row(s)

I see the same thing if I create a non-external table or if I explicitly import data into the external table.

Version info for my setup:

Thrift 0.7
protobuf: libprotoc 2.5.0
hadoop:
Hadoop 2.5.0-cdh5.2.0
Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387
Compiled by jenkins on 2014-10-11T21:00Z
Compiled with protoc 2.5.0
From source with checksum 309bccd135b199bdfdd6df5f3f4153d

UPDATE:

I see this error in the logs. My data in HDFS is just raw protobuf (no compression). I'd like to figure out if that's the issue, and if I can read raw binary protobuf.


    Error: java.io.IOException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:293)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332)
    ... 11 more
    Caused by: java.io.IOException: No codec for file hdfs://foo:8020/user/foo/data/elephantbird/addressbooks/1000AddressBooks-1684394246.bin found
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176)
    at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88)
    at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.(DeprecatedInputFormatWrapper.java:256)
    at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:121)
    at com.twitter.elephantbird.mapred.input.DeprecatedFileInputFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)
    ... 16 more

解决方案

Have you soloved the problem?

I had the same problem just as you described.

Yes you are right, I found out that raw binary protobuf can't be read directly.

This is the problem I had asked. Use elephant-bird with hive to read protobuf data

Hope it helps

Best regards

这篇关于无法查询与elephant-bird配合使用的示例AddressBook protobuf数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆