在Hadoop SequenceFile中处理Writable的全限定名称更改 [英] Handling Writables fully qualified name changes in Hadoop SequenceFile

查看:115
本文介绍了在Hadoop SequenceFile中处理Writable的全限定名称更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆Hadoop SequenceFiles,它们是用我写的一些Writable子类编写的。我们称之为FishWritable。



这个Writable很好用了一段时间,直到我决定为了清晰起见需要重命名包。因此,现在FishWritable的完全限定名称是 com.vertebrates.fishes.FishWritable ,而不是 com.mammals.fishes.FishWritable 。考虑到这个软件包的范围如何发展,这是一个合理的改变。然后我发现我的MapReduce作业都不会运行,因为它们在试图初始化SequenceFileRecordReader:

  java.lang.RuntimeException:java.io.IOException:WritableName无法加载class:com.mammals .fishes.FishWritable 
at org.apache.hadoop.io.SequenceFile $ Reader.getKeyClass(SequenceFile.java:1949)
at org.apache.hadoop.io.SequenceFile $ Reader.init(SequenceFile。 java:1899)
...

处理这个问题的几个选项是立即明显。我可以简单地重新运行我以前的所有作业,以最新的关键类名称重新生成输出,并依次运行任何依赖作业。这显然是非常耗时的,有时甚至是不可能的。



另一种可能性是编写一个简单的作业,将SequenceFile作为文本读取并替换类的任何实例与新的名称。这基本上是方法#1的一个调整,使得它不那么复杂。如果我有很多大文件,它仍然是不切实际的。

有没有更好的方法来处理SequenceFiles中使用的完全限定类名的重构?理想情况下,我正在寻找一些方法来指定一个新的后备类名称,如果没有找到指定的名称,可以针对这个SequenceFile的日期和更新类型运行。

解决方案



文档
$ b


允许重命名Writable实现类而不会使包含其类的文件无效的实用程序名称。




  //为一个类添加一个备用名称。 
public static void addName(Class writableClass,String name)

在你的情况下,你可以调用这在从你的SequenceFiles读取之前:

  WritableName.addName(com.vertebrates.fishes.FishWritable.class,com.mammals。 fishes.FishWritable); 

这样,当试图读取 com.mammals.fishes.FishWritable 来自旧的SequenceFile,将使用新的 com.vertebrates.fishes.FishWritable 类。


$ b $ PS:为什么哺乳动物中的鱼首先包装? ;)


I have a bunch of Hadoop SequenceFiles that have been written with some Writable subclass I wrote. Let's call it FishWritable.

This Writable worked out well for a while, until I decided there was need for a package renaming for clarity. So now the fully qualified name of FishWritable is com.vertebrates.fishes.FishWritable instead of com.mammals.fishes.FishWritable. It was a reasonable change given how the scope of the package in question had evolved.

Then I discover that none of my MapReduce jobs will run, as they crash when attempting to initialize the SequenceFileRecordReader:

java.lang.RuntimeException: java.io.IOException: WritableName can't load class: com.mammals.fishes.FishWritable
at org.apache.hadoop.io.SequenceFile$Reader.getKeyClass(SequenceFile.java:1949)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1899)
...

A couple of options for dealing with this is immediately apparent. I can simply rerun all my previous jobs to regenerate the output with the up to date key class name, running any dependent jobs in sequence. This can obviously be quite time consuming and sometimes not even possible.

Another possibility might be to write a simple job that reads the SequenceFile as text and replaces any instances of the class name with the new one. This is basically method #1 with a tweak that makes it less complicated to do. If I have a lot of big files it's still quite impractical.

Is there a better way to deal with refactorings of fully qualified class names used in SequenceFiles? Ideally, I'm looking for some way to specify a new fallback class name if the specified one is not found, to allow for running against both dated and updated types of this SequenceFile.

解决方案

The org.apache.hadoop.io.WritableName class mentioned in the exception stack trace has some useful methods.

From the doc:

Utility to permit renaming of Writable implementation classes without invalidiating files that contain their class name.

// Add an alternate name for a class.
public static void addName(Class writableClass, String name)

In your case you could call this before reading from your SequenceFiles:

WritableName.addName(com.vertebrates.fishes.FishWritable.class, "com.mammals.fishes.FishWritable");

This way, when attempting to read a com.mammals.fishes.FishWritable from an old SequenceFile, the new com.vertebrates.fishes.FishWritable class will be used.

PS: Why was the fish in the mammals package in the first place? ;)

这篇关于在Hadoop SequenceFile中处理Writable的全限定名称更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆