在Hadoop中传播自定义配置值 [英] Propagating custom configuration values in Hadoop
问题描述
在Map / Reduce期间,是否有任何方法设置和(稍后)在Hadoop中获取自定义配置对象?
例如,假设一个应用程序预处理大文件并动态确定与文件相关的一些特性。此外,假设那些特性被保存在自定义Java对象(例如, Properties
对象,但不排他地,因为一些可能不是字符串),并且随后对于每个
应用程序如何传播此配置,以便每个映射器和缩减器函数可以在需要时访问它? p>
一种方法是使用 除非我错过了一些东西,如果你有一个 然后在您的M / R作业中,您可以使用 请注意,当使用 另外值得一提的是你可以直接调用 EDIT :在非String对象的情况下,我建议使用序列化:你可以序列化你的对象,然后将它们转换为字符串(可能编码他们例如Base64,因为我不知道如果你有异常会发生什么字符),然后在mapper / reducer端对从 另一种方法是执行相同的序列化技术,而是写入HDFS,然后将这些文件添加到 Is there any way to set and (later) get a custom configuration object in Hadoop, during Map/Reduce? For example, assume an application that preprocesses a large file and determines dynamically some characteristics related to the file. Furthermore, assume that those characteristics are saved in a custom Java object (e.g., a How could the application "propagate" this configuration, so that each mapper and reducer function can access it, when needed? One approach could be to use the Unless I'm missing something, if you have a Then inside your M/R job, you can use the Note that when using the Also it's worth mentioning you could probably directly call the EDIT: In case of non-String objects, I would advise using serialization: You can serialize your objects, and then convert them to Strings (probably encode them for example with Base64 as I'm not sure what would happen if you have unusual characters), and then on the mapper/reducer side de-serialize the objects from the Strings you get from the properties inside Another approach would be to do the same serialization technique, but instead write to HDFS, and then add these files to the 这篇关于在Hadoop中传播自定义配置值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! JobConf 的
>反正(例如,遵循 set(String,String)
/ code>类,例如,通过第二个参数将配置对象序列化为 JSON
字符串,但这可能是一个黑客,每个 Mapper
和 Reducer
JobConf Properties
对象包含你需要在你的M / R作业中的每个属性,你只需要写的内容 Properties
对象到Hadoop 配置
对象。例如,像这样:
配置conf = new Configuration();
属性params = getParameters(); //做任何你在这里需要创建你的对象
for(Entry< Object,Object> entry:params.entrySet()){
String propName =(String)entry.getKey
String propValue =(String)entry.getValue();
conf.set(propName,propValue);
}
Context
对象,以在mapper( map
函数)中返回 Configuration
)或reducer( reduce
函数),如下所示:
public void map(MD5Hash key,OverlapDataWritable value,Context context)
配置conf = context.getConfiguration();
String someProperty = conf.get(something);
....
}
配置
对象,您还可以访问设置
中的上下文
code> cleanup 方法,如果需要可以进行一些初始化。
addResource
方法从配置
对象直接添加属性为 InputStream
或者一个文件,但我相信这必须是一个XML配置像常规的Hadoop XML配置,所以这可能只是overkill。
Configuration
中的属性中获取的Strings中的对象进行反序列化。
DistributedCache
。听起来有点过度,但这可能会工作。Properties
object, but not exclusively, since some may not be strings) and are subsequently necessary for each of the map and of the reduce jobs.set(String, String)
method of the JobConf
class and, for instance, pass the configuration object serialized as a JSON
string via the second parameter, but this may be too much of a hack and then the appropriate JobConf
instance must be accessed by each Mapper
and Reducer
anyway (e.g., following an approach like the one suggested in an earlier question).Properties
object containing every property you need in your M/R job, you simply need to write the content of the Properties
object to the Hadoop Configuration
object. For example, something like this:Configuration conf = new Configuration();
Properties params = getParameters(); // do whatever you need here to create your object
for (Entry<Object, Object> entry : params.entrySet()) {
String propName = (String)entry.getKey();
String propValue = (String)entry.getValue();
conf.set(propName, propValue);
}
Context
object to get back your Configuration
in both the mapper (the map
function) or the reducer (the reduce
function), like this:public void map(MD5Hash key, OverlapDataWritable value, Context context)
Configuration conf = context.getConfiguration();
String someProperty = conf.get("something");
....
}
Configuration
object, you can also access the Context
in the setup
and cleanup
methods, useful to do some initialization if needed.addResource
method from the Configuration
object to add your properties directly as an InputStream
or a file, but I believe this has to be an XML configuration like the regular Hadoop XML configs, so that might just be overkill.Configuration
. DistributedCache
. Sounds a bit overkill, but this would probably work.