将对象副本传递给hadoop中所有映射器的最佳实践 [英] Best practice to pass copy of object to all mappers in hadoop

查看:74
本文介绍了将对象副本传递给hadoop中所有映射器的最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我目前正在学习Map Reduce,并试图用hadoop 1.0.4构建一个小的Job。我有一个stopp单词列表和一个模式列表。在我的文件映射之前,我想在一个有效的数据结构(如地图)中加载stoppwords。我也想从我的模式列表中构建一个正则表达式模式。由于这些是串行任务,我希望在映射前执行​​它们,并将每个映射器的副本传递给它们可以读取/写入的对象。我想在我的驱动程序类中使用带有getter的静态变量,但是使用java调用对象作为指针原理,这种方法无法解决。在我通过它之前,我当然可以克隆这个对象,但这看起来并不是一个好习惯。我读了一些关于分布式缓存的内容,但据我了解,它只适用于文件而不适用于对象,并且我可以让每个映射器都读取停止词/模式文件。



感谢您的帮助!

解决方案

可能的解决方案是在运行作业之前将stopwords.txt复制到HDFS ,
,然后将其读入Mapper的设置方法。例如:

MyMapper类:

  ... 
私人地图< String,Object> stopwords = null;

@Override
public void setup(Context context){
Configuration conf = context.getConfiguration();
//硬编码或将其设置在jobrunner类中并通过此键检索
字符串位置= conf.get(job.stopwords.path);
if(location!= null){
BufferedReader br = null;
尝试{
FileSystem fs = FileSystem.get(conf);
路径路径=新路径(位置);
if(fs.exists(path)){
stopwords = new HashMap< String,Object>();
FSDataInputStream fis = fs.open(path);
br = new BufferedReader(new InputStreamReader(fis));
String line = null; ((line = br.readLine())!= null&& line.trim()。length()> 0){
stopwords.put(line,null);
;


$ b catch(IOException e){
//句柄
}
finally {
IOUtils.closeQuietly (BR);
}
}
}
...

然后,您可以在地图方法中使用停用词

另一个选项是使用jobrunner类中的停用词创建地图对象,
将其序列化为Base64编码的字符串,将其作为Configuration对象中某个键的值传递给映射器,并在setup方法中将其反序列化。



I选择第一个选项,不仅仅因为它更简单,而且因为通过Configuration对象传递更多数据不是一个好主意。


Hello I am currently learning Map Reduce and am trying to build a small Job with hadoop 1.0.4. I have a list of stopp words and a list of patterns. Before my files are mapped I want to load the stoppwords in an efficient Datastructure such as a map. I also want to build one regex pattern from my patternlist. Since these are serial tasks I want to do them in front of the mapping and pass every mapper a copy of those to objects which they can read/write on. I thought about simply having a static variable with a getter in my drivers class but with the java call objects as pointers principle this doesn't work out. I could of course clone the object before I pass it, but this really does not seem like a good practice. I read something about distributed cache but as far as I understood it, its only for files and not for objects and than I could just let every mapper read the stopp word/pattern files.

Thanks for any help!

解决方案

A possible solution is to copy the stopwords.txt to the HDFS before running the job, and then read it into an appropriate data structure in the Mapper's setup method. E.g:

MyMapper class:

...
private Map<String, Object> stopwords = null;

@Override
public void setup(Context context) {
    Configuration conf = context.getConfiguration();
    //hardcoded or set it in the jobrunner class and retrieve via this key
    String location = conf.get("job.stopwords.path");
    if (location != null) {
        BufferedReader br = null;
        try {
            FileSystem fs = FileSystem.get(conf);
            Path path = new Path(location);
            if (fs.exists(path)) {
                stopwords = new HashMap<String, Object>();
                FSDataInputStream fis = fs.open(path);
                br = new BufferedReader(new InputStreamReader(fis));
                String line = null;
                while ((line = br.readLine()) != null && line.trim().length() > 0) {
                    stopwords.put(line, null);
                }
            }
        }
        catch (IOException e) {
            //handle
        } 
        finally {
            IOUtils.closeQuietly(br);
        }
    }
}
...

Then you can use stopwords in your map method.

Another option is to create the map object with the stopwords in the jobrunner class, serialize it to a Base64 encoded String, pass it to the mappers as a value of some key in the Configuration object and deserialize it in the setup method.

I'd choose the first option, not just because it's easier, but because it's not a good idea to pass bigger amount of data via the Configuration object.

这篇关于将对象副本传递给hadoop中所有映射器的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆