在hadoop / map中读取avro格式的数据 [英] Reading avro format data in hadoop/map reduce

查看:280
本文介绍了在hadoop / map中读取avro格式的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取保存在hdfs中的hadoop中的avro格式数据。
但是我见过的大多数例子都要求我们解析一个模式才行。
但是我无法理解这个需求。我使用pig和avro,并且我从来没有传过架构信息。

I am trying to read avro format data in hadoop saved in hdfs. But most of the examples I have seen requires us to parse a schema to the job.. But I am not able to understand that requirement. I use pig and avro and I have never passed schema information.

所以,我想我可能会错过一些东西。基本上,如果我没有模式信息,读取hadoop mapreduce中的avro文件有什么好方法?
谢谢

So, I think I might be missing something. Basically, whats a good way to read avro files in hadoop mapreduce if I don't have schema information? Thanks

推荐答案

你是对的,Avro对提前知道类型非常严格。我知道的唯一选择是,如果您不知道模式,请将其作为 GenericRecord 来读取。下面是如何做到这一点的一个片段

You're right, Avro is pretty strict about knowing the type in advance. The only option I know of, if you have no idea the schema, is to read it as a GenericRecord. Here's a snippet of how to do that

public class MyMapper extends extends Mapper<AvroKey<GenericRecord>, NullWritable, ... > {
    @Override
    protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {
        GenericRecord datum = key.datum();
        Schema schema = datum.getSchema();
        Object field1 = datam.get(0);
        Object someField = datam.get("someField");
        ...
    }
}

您不会当然有好的getter和setter,因为Java不知道它是什么类型。可用的唯一获取者可以通过位置或名称检索字段。您必须将结果转换为您知道该字段的类型。如果你不知道,你必须有 instanceof 检查每一种可能性,因为Java是静态编译的(这也是为什么它不如你可能在首先认为你可以访问模式)。

You won't have the nice getters and setters of course, since Java doesn't know know what type it is. The only getters available retrieve fields by either position or name. You'll have to cast the result to the type that you know the field to be. If you don't know, you'll have to have instanceof checks for every possibility, since Java is statically compiled (this is also why it's not as helpful as you might at first think that you have access to the schema).

但是如果你知道它可能(或者应该是)的类型,你可以调用从avsc生成的类的getSchema()(你期望你的输入是),创建一个新的实例,然后将字段逐个映射到GenericRecord中的新对象。这会让你回到正常的Avro方法。当处理联合,空值和模式版本时,这会变得更加复杂。

But if you know the type it could be (or should be), you can call getSchema() on the class generated from avsc (that you expect your input to be), create a new instance of it, then map the fields one by one onto that new object from the GenericRecord. This would give you back access to the normal Avro methods. This gets more complicated of course when dealing with unions, nulls, and schema versioning.

这篇关于在hadoop / map中读取avro格式的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆