将对象副本传递给hadoop中所有映射器的最佳实践 [英] Best practice to pass copy of object to all mappers in hadoop
问题描述
您好,我目前正在学习Map Reduce,并试图用hadoop 1.0.4构建一个小的Job。我有一个stopp单词列表和一个模式列表。在我的文件映射之前,我想在一个有效的数据结构(如地图)中加载stoppwords。我也想从我的模式列表中构建一个正则表达式模式。由于这些是串行任务,我希望在映射前执行它们,并将每个映射器的副本传递给它们可以读取/写入的对象。我想在我的驱动程序类中使用带有getter的静态变量,但是使用java调用对象作为指针原理,这种方法无法解决。在我通过它之前,我当然可以克隆这个对象,但这看起来并不是一个好习惯。我读了一些关于分布式缓存的内容,但据我了解,它只适用于文件而不适用于对象,并且我可以让每个映射器都读取停止词/模式文件。
感谢您的帮助!
可能的解决方案是在运行作业之前将stopwords.txt复制到HDFS ,
,然后将其读入Mapper的设置方法。例如:
MyMapper类:
...
私人地图< String,Object> stopwords = null;
@Override
public void setup(Context context){
Configuration conf = context.getConfiguration();
//硬编码或将其设置在jobrunner类中并通过此键检索
字符串位置= conf.get(job.stopwords.path);
if(location!= null){
BufferedReader br = null;
尝试{
FileSystem fs = FileSystem.get(conf);
路径路径=新路径(位置);
if(fs.exists(path)){
stopwords = new HashMap< String,Object>();
FSDataInputStream fis = fs.open(path);
br = new BufferedReader(new InputStreamReader(fis));
String line = null; ((line = br.readLine())!= null&& line.trim()。length()> 0){
stopwords.put(line,null);
;
$ b catch(IOException e){
//句柄
}
finally {
IOUtils.closeQuietly (BR);
}
}
}
...
然后,您可以在地图方法中使用停用词。
另一个选项是使用jobrunner类中的停用词创建地图对象,将其序列化为Base64编码的字符串,将其作为Configuration对象中某个键的值传递给映射器,并在setup方法中将其反序列化。
I选择第一个选项,不仅仅因为它更简单,而且因为通过Configuration对象传递更多数据不是一个好主意。
Hello I am currently learning Map Reduce and am trying to build a small Job with hadoop 1.0.4. I have a list of stopp words and a list of patterns. Before my files are mapped I want to load the stoppwords in an efficient Datastructure such as a map. I also want to build one regex pattern from my patternlist. Since these are serial tasks I want to do them in front of the mapping and pass every mapper a copy of those to objects which they can read/write on. I thought about simply having a static variable with a getter in my drivers class but with the java call objects as pointers principle this doesn't work out. I could of course clone the object before I pass it, but this really does not seem like a good practice. I read something about distributed cache but as far as I understood it, its only for files and not for objects and than I could just let every mapper read the stopp word/pattern files.
Thanks for any help!
A possible solution is to copy the stopwords.txt to the HDFS before running the job, and then read it into an appropriate data structure in the Mapper's setup method. E.g:
MyMapper class:
...
private Map<String, Object> stopwords = null;
@Override
public void setup(Context context) {
Configuration conf = context.getConfiguration();
//hardcoded or set it in the jobrunner class and retrieve via this key
String location = conf.get("job.stopwords.path");
if (location != null) {
BufferedReader br = null;
try {
FileSystem fs = FileSystem.get(conf);
Path path = new Path(location);
if (fs.exists(path)) {
stopwords = new HashMap<String, Object>();
FSDataInputStream fis = fs.open(path);
br = new BufferedReader(new InputStreamReader(fis));
String line = null;
while ((line = br.readLine()) != null && line.trim().length() > 0) {
stopwords.put(line, null);
}
}
}
catch (IOException e) {
//handle
}
finally {
IOUtils.closeQuietly(br);
}
}
}
...
Then you can use stopwords in your map method.
Another option is to create the map object with the stopwords in the jobrunner class, serialize it to a Base64 encoded String, pass it to the mappers as a value of some key in the Configuration object and deserialize it in the setup method.
I'd choose the first option, not just because it's easier, but because it's not a good idea to pass bigger amount of data via the Configuration object.
这篇关于将对象副本传递给hadoop中所有映射器的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!