在Hadoop中传播自定义配置值 [英] Propagating custom configuration values in Hadoop

查看:150
本文介绍了在Hadoop中传播自定义配置值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Map / Reduce期间,是否有任何方法设置和(稍后)在Hadoop中获取自定义配置对象?



例如,假设一个应用程序预处理大文件并动态确定与文件相关的一些特性。此外,假设那些特性被保存在自定义Java对象(例如, Properties 对象,但不排他地,因为一些可能不是字符串),并且随后对于每个



应用程序如何传播此配置,以便每个映射器和缩减器函数可以在需要时访问它? p>

一种方法是使用 JobConf 的 set(String,String) / code>类,例如,通过第二个参数将配置对象序列化为 JSON 字符串,但这可能是一个黑客,每个 Mapper Reducer JobConf >反正(例如,遵循

除非我错过了一些东西,如果你有一个 Properties 对象包含你需要在你的M / R作业中的每个属性,你只需要写的内容 Properties 对象到Hadoop 配置对象。例如,像这样:

 配置conf = new Configuration(); 
属性params = getParameters(); //做任何你在这里需要创建你的对象
for(Entry< Object,Object> entry:params.entrySet()){
String propName =(String)entry.getKey
String propValue =(String)entry.getValue();
conf.set(propName,propValue);
}

然后在您的M / R作业中,您可以使用 Context 对象,以在mapper( map 函数)中返回 Configuration )或reducer( reduce 函数),如下所示:

  public void map(MD5Hash key,OverlapDataWritable value,Context context)
配置conf = context.getConfiguration();
String someProperty = conf.get(something);
....
}

请注意,当使用配置对象,您还可以访问设置中的上下文 code> cleanup 方法,如果需要可以进行一些初始化。



另外值得一提的是你可以直接调用 addResource 方法从配置对象直接添加属性为 InputStream 或者一个文件,但我相信这必须是一个XML配置像常规的Hadoop XML配置,所以这可能只是overkill。



EDIT :在非String对象的情况下,我建议使用序列化:你可以序列化你的对象,然后将它们转换为字符串(可能编码他们例如Base64,因为我不知道如果你有异常会发生什么字符),然后在mapper / reducer端对从 Configuration 中的属性中获取的Strings中的对象进行反序列化。



另一种方法是执行相同的序列化技术,而是写入HDFS,然后将这些文件添加到 DistributedCache 。听起来有点过度,但这可能会工作。


Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce?

For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a Properties object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.

How could the application "propagate" this configuration, so that each mapper and reducer function can access it, when needed?

One approach could be to use the set(String, String) method of the JobConf class and, for instance, pass the configuration object serialized as a JSON string via the second parameter, but this may be too much of a hack and then the appropriate JobConf instance must be accessed by each Mapper and Reducer anyway (e.g., following an approach like the one suggested in an earlier question).

解决方案

Unless I'm missing something, if you have a Properties object containing every property you need in your M/R job, you simply need to write the content of the Properties object to the Hadoop Configuration object. For example, something like this:

Configuration conf = new Configuration();
Properties params = getParameters(); // do whatever you need here to create your object
for (Entry<Object, Object> entry : params.entrySet()) {
    String propName = (String)entry.getKey();
    String propValue = (String)entry.getValue();
    conf.set(propName, propValue);
}

Then inside your M/R job, you can use the Context object to get back your Configuration in both the mapper (the map function) or the reducer (the reduce function), like this:

public void map(MD5Hash key, OverlapDataWritable value, Context context)
    Configuration conf = context.getConfiguration();
    String someProperty = conf.get("something");
    ....
}

Note that when using the Configuration object, you can also access the Context in the setup and cleanup methods, useful to do some initialization if needed.

Also it's worth mentioning you could probably directly call the addResource method from the Configuration object to add your properties directly as an InputStream or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.

EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Configuration.

Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the DistributedCache. Sounds a bit overkill, but this would probably work.

这篇关于在Hadoop中传播自定义配置值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆