使用带有蜂巢的大象鸟来读取protobuf数据 [英] Use elephant-bird with hive to read protobuf data

查看:252
本文介绍了使用带有蜂巢的大象鸟来读取protobuf数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有类似的问题,例如此一个



以下是我用过的:


  1. CDH4 .4(hive 0.10)
  2. protobuf -java -.2.4.1.jar
  3. elephant-bird-hive-4.6-SNAPSHOT.jar
  4. / li>
  5. elephant-bird-core-4.6-SNAPSHOT.jar

  6. elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar

  7. 包含protoc编译的.class文件的jar文件。

我的流量 Protocol Buffer java tutorial 创建我的数据testbook。



和我使用 hdfs dfs -mkdir / protobuf_data 创建HDFS文件夹。



使用 hdfs dfs -put testbook / protobuf_data 将testbook放到HDFS中。



然后我按照大象鸟网页到创建表,语法如下:

  create table addressbook 
行格式serdecom.twitter.elephantbird.hive .serde.ProtobufDeserializer
with serdeproperties(
serialization.class=com.example.tutorial.AddressBookProtos $ AddressBook)
存储为
inputformatcom.twitter。 elephantbird.mapred.input.DeprecatedRawMultiInputFormat
OUTPUTFORMATorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
LOCATION'/ protobuf_data /';

所有工作都完成了。

但是当我提交查询 select * from addressbook; 没有结果出来。



我找不到任何日志错误进行调试。

有人可以帮我吗?

非常感谢



首先,我将protobuf二进制数据直接放入HDFS中,但没有显示结果。

>

因为它不是那种方式。



在询问一些高级同事后,他们说应该写成protobuf二进制数据进入某种容器,某些文件格式,如hadoop SequenceFile等。

elephant-bird 页面也写了这些信息,但是我先是无法理解它完全。



在将protobuf二进制数据写入sequenceFile后,我可以用hive读取protobuf数据。

因为我使用了sequenceFile格式,所以我使用create table语法:

  inputformat'org.apache.hadoop.m apred.SequenceFileInputFormat'
outputformat'org.apache.hadoop.mapred.SequenceFileOutputFormat'

希望它可以帮助那些对hadoop,蜂房,大象也是新手的人。

I have a similar problem like this one

The followning are what I used:

  1. CDH4.4 (hive 0.10)
  2. protobuf-java-.2.4.1.jar
  3. elephant-bird-hive-4.6-SNAPSHOT.jar
  4. elephant-bird-core-4.6-SNAPSHOT.jar
  5. elephant-bird-hadoop-compat-4.6-SNAPSHOT.jar
  6. The jar file which include the protoc compiled .class file.

And I flow Protocol Buffer java tutorial create my data "testbook".

And I

use hdfs dfs -mkdir /protobuf_data to create HDFS folder.

Use hdfs dfs -put testbook /protobuf_data to put "testbook" to HDFS.

Then I follow elephant-bird web page to create table, syntax is like this:

create table addressbook
  row format serde "com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
  with serdeproperties (
    "serialization.class"="com.example.tutorial.AddressBookProtos$AddressBook")
  stored as
    inputformat "com.twitter.elephantbird.mapred.input.DeprecatedRawMultiInputFormat"
    OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
  LOCATION '/protobuf_data/';

All worked.

But when I submit the query select * from addressbook; no result came out.

And I couldn't find any logs with errors to debug.

Could someone help me ?

Many thanks

解决方案

The problem had been solved.

First I put protobuf binary data directly into HDFS, no result showed.

Because it doesn't work that way.

After asking some senior colleagues, they said protobuf binary data should be written into some kind of container, some file format, like hadoop SequenceFile etc.

The elephant-bird page had written the information too, but first I couldn't understand it completely.

After writing protobuf binary data into sequenceFile, I can read the protobuf data with hive.

And because I use sequenceFile format, so I use the create table syntax:

inputformat 'org.apache.hadoop.mapred.SequenceFileInputFormat'
outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'

Hope it can help others who are new to hadoop, hive, elephant too.

这篇关于使用带有蜂巢的大象鸟来读取protobuf数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆