无法查询与elephant-bird配合使用的示例AddressBook protobuf数据 [英] Cannot query example AddressBook protobuf data in hive with elephant-bird

查看：295 发布时间：2018/6/12 14:12:58 hadoop hive protocol-buffers elephantbird

本文介绍了无法查询与elephant-bird配合使用的示例AddressBook protobuf数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用大象鸟来查询一些protobuf数据的例子。我正在使用AddressBook示例，并将少量伪造的AddressBook序列化为文件，并将它们放在/ user / foo / data / elephant-bird / addressbooks /下的hdfs中。查询返回结果

我设置表并查询，如下所示：

 
 add jar / home / foo / downloads / elephant-bird /hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar; 
 add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar; 
 add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar; 
 
创建外部表格地址
行格式serdecom.twitter.elephantbird.hive.serde.ProtobufDeserializer
 with serdeproperties（
serialization.class= com.twitter.data.proto.tutorial.AddressBookProtos $ AddressBook）
 STORED AS 
  -  elephant-bird提供了一个输入格式，用于配置单元
 INPUTFORMATcom.twitter.elephantbird。 mapred.input.DeprecatedRawMultiInputFormat
  - 占位符，因为我们不会写入此表
 OUTPUTFORMATorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 LOCATION'/ user /富/数据/象鸟/ addressbooks /'; 
 
 
描述格式化地址; 
 
 
＃col_name data_type comment 
 
 person array {struct {name：string，id：int，email：string，phone：array {struct {number：字符串，类型：字符串}}}}来自反序列化器
 byte解析器的二进制数据
 
＃详细表信息
数据库：默认
拥有者：foo 
 CreateTime ：Tue Oct 28 13:49:53 PDT 2014 
 LastAccessTime：UNKNOWN 
保护模式：无
保留：0 
位置：hdfs：// foo：8020 / user / foo / data / elephant-bird / addressbooks 
表类型：EXTERNAL_TABLE 
表参数：
 EXTERNAL TRUE 
 transient_lastDdlTime 1414529393 
 
＃存储信息
 SerDe库：com.twitter.elephantbird.hive.serde.ProtobufDeserializer 
 InputFormat：com.twitter.elephantbird.mapred.input.DeprecatedRawMu ltiInputFormat 
 OutputFormat：org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat 
压缩：否
数量桶：-1 
桶列：[] 
排序列：[] 
 Storage Desc Params：
 serialization.class com.twitter.data.proto.tutorial.AddressBookProtos $ AddressBook 
 serialization.format 1 
需要的时间：0.421秒，提取：29行

当我尝试选择数据时，它不会返回任何结果（不会显示为读取行）：

 
从地址中选择count（*）; 
 
总MapReduce作业= 1 
启动作业1满分1 
编译时确定的减少任务数量：1 
为了改变平均负载减速器（以字节为单位）：
 set hive.exec.reducers.bytes.per.reducer = 
为了限制减速器的最大数目：
 set hive.exec.reducers.max = 
为了设置一个固定数量的简化器：
 set mapred.reduce.tasks = 
 Starting Job = job_1413311929339_0061，跟踪URL = http：// foo：8088 / proxy / application_1413311929339_0061 / 
 Kill Command = / usr / lib / hadoop / bin / hadoop job -kill job_1413311929339_0061 
 Stage-1的Hadoop作业信息：mappers的数量：0;第一阶段地图= 0％，减少= 0％
 2014-10-28 13：50：51,055阶段1地图= 0％，减少= 0％
 2014-10-28 13:50:37,674阶段1地图= 0％，减少= 100％，累计CPU 2.14秒
 2014-10-28 13：50：52,152第一阶段地图= 0％，减少= 100％，累计CPU 2.14秒
 MapReduce总累积CPU时间：2秒140毫秒
完成作业= job_1413311929339_0061 
 MapReduce作业启动：
作业0：减少：1累积CPU：2.14秒HDFS读取：0 HDFS写入：2 SUCCESS 
 Total MapReduce CPU使用时间：2秒140毫秒
确定
 0 
所需时间：37.519秒，提取：1行

如果我创建非外部表或者将数据显式导入外部表，我会看到相同的结果。

版本我的设置信息：

 
 Thrift 0.7 
 protobuf：libprotoc 2.5.0 
 hadoop：
 Hadoop 2.5.0-cdh5.2.0 
 Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387 
由jen编译在2014-10-11T21：00Z 
编译与protoc 2.5.0 
从源码校验和309bccd135b199bdfdd6df5f3f4153d

更新：

我在日志中看到此错误。我在HDFS中的数据只是原始的protobuf（不压缩）。我想弄清楚是否这是问题，如果我可以读取原始二进制protobuf。

 
 
错误：java .io.IOException：java.lang.reflect.InvocationTargetException 
 at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException（HiveIOExceptionHandlerChain.java:97）
 at org.apache.hadoop.hive。 io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException（HiveIOExceptionHandlerUtil.java:57）
 at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileRecordReader.initNextRecordReader（HadoopShimsSecure.java:346）
 at org.apache.hadoop。 hive.shims.HadoopShimsSecure $ CombineFileRecordReader。（HadoopShimsSecure.java:293）
 at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileInputFormatShim.getRecordReader（HadoopShimsSecure.java:407）
 at org.apache .hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader（CombineHiveInputFormat.java:560）
 at org.apache.hadoop.mapred.MapT请问$ TrackedRecordReader。（MapTask.java:168）
 at org.apache.hadoop.mapred.MapTask.runOldMapper（MapTask.java:409）
 at org.apache.hadoop.mapred.MapTask.run （MapTask.java:342）
 at org.apache.hadoop.mapred.YarnChild $ 2.run（YarnChild.java:167）
 at java.security.AccessController.doPrivileged（Native Method）
位于org.apache.hadoop.security.UserGroupInformation.doAs（UserGroupInformation.java:1554）
位于org.apache.javax.security.auth.Subject.doAs（Subject.java:415）
。 hadoop.mapred.YarnChild.main（YarnChild.java:162）
导致：java.lang.reflect.InvocationTargetException $ b $在sun.reflect.NativeConstructorAccessorImpl.newInstance0（本地方法）
在太阳.reflect.NativeConstructorAccessorImpl.newInstance（NativeConstructorAccessorImpl.java:57）
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance（DelegatingConstructorAccessorImpl.java:45）
 at java.lang.reflect.Constructor.newInstance（Constructor.java ：526 ）
 at org.apache.hadoop.hive.shims.HadoopShimsSecure $ CombineFileRecordReader.initNextRecordReader（HadoopShimsSecure.java:332）
 ... 11 more 
导致：java.io.IOException：无文件hdfs：// foo：8020 / user / foo / data / elephantbird / addressbooks / 1000AddressBooks-1684394246.bin的编解码器在com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat中找到
（MultiInputFormat.java ：176）
 at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader（MultiInputFormat.java:88）
 at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader（RawMultiInputFormat.java ：36）
 at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper $ RecordReaderWrapper。（DeprecatedInputFormatWrapper.java:256）
 at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader（DeprecatedInputFormatWrapper。 java：121）
 at com.twitter.elephantbird.mapred.input.DeprecatedFileIn putFormatWrapper.getRecordReader（DeprecatedFileInputFormatWrapper.java:55）
 at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader。（CombineHiveRecordReader.java:65）
 ... 16 more

解决方案

你解决了这个问题吗？

<

是的，我发现原始的二进制protobuf不能直接读取。

我的问题与您所描述的相同。

这是我所问的问题。
使用带有蜂巢的大象鸟来读取protobuf数据

希望有帮助

最好的祝愿

I'm trying to use elephant bird to query some example protobuf data. I'm using the AddressBook example, and I serialized a handful of fake AddressBooks into files and put them in hdfs under /user/foo/data/elephant-bird/addressbooks/ The query returns no results

I setup the table and query like so:
add jar /home/foo/downloads/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar; add jar /home/foo/downloads/elephant-bird/core/target/elephant-bird-core-4.6-SNAPSHOT.jar; add jar /home/foo/downloads/elephant-bird/hive/target/elephant-bird-hive-4.6-SNAPSHOT.jar; create external table addresses row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer" with serdeproperties ( "serialization.class"="com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook") STORED AS -- elephant-bird provides an input format for use with hive INPUTFORMAT "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat" -- placeholder as we will not be writing to this table OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" LOCATION '/user/foo/data/elephant-bird/addressbooks/'; describe formatted addresses; OK # col_name data_type comment person array{ struct{ name:string, id:int, email:string, phone:array {struct {number:string, type:string}}}} from deserializer byteData binary from deserializer # Detailed Table Information Database: default Owner: foo CreateTime: Tue Oct 28 13:49:53 PDT 2014 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://foo:8020/user/foo/data/elephant-bird/addressbooks Table Type: EXTERNAL_TABLE Table Parameters: EXTERNAL TRUE transient_lastDdlTime 1414529393 # Storage Information SerDe Library: com.twitter.elephantbird.hive.serde.ProtobufDeserializer InputFormat: com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.class com.twitter.data.proto.tutorial.AddressBookProtos$AddressBook serialization.format 1 Time taken: 0.421 seconds, Fetched: 29 row(s)
When I try to select data, it returns no results (doesn't appear to read rows):
select count(*) from addresses; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_1413311929339_0061, Tracking URL = http://foo:8088/proxy/application_1413311929339_0061/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1413311929339_0061 Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 1 2014-10-28 13:50:37,674 Stage-1 map = 0%, reduce = 0% 2014-10-28 13:50:51,055 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec 2014-10-28 13:50:52,152 Stage-1 map = 0%, reduce = 100%, Cumulative CPU 2.14 sec MapReduce Total cumulative CPU time: 2 seconds 140 msec Ended Job = job_1413311929339_0061 MapReduce Jobs Launched: Job 0: Reduce: 1 Cumulative CPU: 2.14 sec HDFS Read: 0 HDFS Write: 2 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 140 msec OK 0 Time taken: 37.519 seconds, Fetched: 1 row(s)
I see the same thing if I create a non-external table or if I explicitly import data into the external table.

Version info for my setup:
Thrift 0.7 protobuf: libprotoc 2.5.0 hadoop: Hadoop 2.5.0-cdh5.2.0 Subversion http://github.com/cloudera/hadoop -r e1f20a08bde76a33b79df026d00a0c91b2298387 Compiled by jenkins on 2014-10-11T21:00Z Compiled with protoc 2.5.0 From source with checksum 309bccd135b199bdfdd6df5f3f4153d
UPDATE:

I see this error in the logs. My data in HDFS is just raw protobuf (no compression). I'd like to figure out if that's the issue, and if I can read raw binary protobuf.
Error: java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:346) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:293) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:407) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:560) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:332) ... 11 more Caused by: java.io.IOException: No codec for file hdfs://foo:8020/user/foo/data/elephantbird/addressbooks/1000AddressBooks-1684394246.bin found at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.determineFileFormat(MultiInputFormat.java:176) at com.twitter.elephantbird.mapreduce.input.MultiInputFormat.createRecordReader(MultiInputFormat.java:88) at com.twitter.elephantbird.mapreduce.input.RawMultiInputFormat.createRecordReader(RawMultiInputFormat.java:36) at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper$RecordReaderWrapper.(DeprecatedInputFormatWrapper.java:256) at com.twitter.elephantbird.mapred.input.DeprecatedInputFormatWrapper.getRecordReader(DeprecatedInputFormatWrapper.java:121) at com.twitter.elephantbird.mapred.input.DeprecatedFileInputFormatWrapper.getRecordReader(DeprecatedFileInputFormatWrapper.java:55) at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65) ... 16 more

解决方案
Have you soloved the problem?

I had the same problem just as you described.

Yes you are right, I found out that raw binary protobuf can't be read directly.

This is the problem I had asked. Use elephant-bird with hive to read protobuf data

Hope it helps

Best regards

这篇关于无法查询与elephant-bird配合使用的示例AddressBook protobuf数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

无法查询与elephant-bird配合使用的示例AddressBook protobuf数据 [英] Cannot query example AddressBook protobuf data in hive with elephant-bird

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

无法查询与elephant-bird配合使用的示例AddressBook protobuf数据 [英] Cannot query example AddressBook protobuf data in hive with elephant-bird

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭