将对象副本传递给hadoop中所有映射器的最佳实践 [英] Best practice to pass copy of object to all mappers in hadoop

查看：74 发布时间：2018/5/31 19:38:04 java hadoop mapreduce

本文介绍了将对象副本传递给hadoop中所有映射器的最佳实践的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您好，我目前正在学习Map Reduce，并试图用hadoop 1.0.4构建一个小的Job。我有一个stopp单词列表和一个模式列表。在我的文件映射之前，我想在一个有效的数据结构（如地图）中加载stoppwords。我也想从我的模式列表中构建一个正则表达式模式。由于这些是串行任务，我希望在映射前执行它们，并将每个映射器的副本传递给它们可以读取/写入的对象。我想在我的驱动程序类中使用带有getter的静态变量，但是使用java调用对象作为指针原理，这种方法无法解决。在我通过它之前，我当然可以克隆这个对象，但这看起来并不是一个好习惯。我读了一些关于分布式缓存的内容，但据我了解，它只适用于文件而不适用于对象，并且我可以让每个映射器都读取停止词/模式文件。

感谢您的帮助！

解决方案

可能的解决方案是在运行作业之前将stopwords.txt复制到HDFS ，
，然后将其读入Mapper的设置方法。例如：

MyMapper类：

  ... 
私人地图< String，Object> stopwords = null; 
 
 @Override 
 public void setup（Context context）{
 Configuration conf = context.getConfiguration（）; 
 //硬编码或将其设置在jobrunner类中并通过此键检索
字符串位置= conf.get（job.stopwords.path）; 
 if（location！= null）{
 BufferedReader br = null; 
尝试{
 FileSystem fs = FileSystem.get（conf）; 
路径路径=新路径（位置）; 
 if（fs.exists（path））{
 stopwords = new HashMap< String，Object>（）; 
 FSDataInputStream fis = fs.open（path）; 
 br = new BufferedReader（new InputStreamReader（fis））; 
 String line = null; （（line = br.readLine（））！= null&& line.trim（）。length（）> 0）{
 stopwords.put（line，null）; 
; 
 
 
 $ b catch（IOException e）{
 //句柄
} 
 finally {
 IOUtils.closeQuietly （BR）; 
} 
} 
} 
 ...

然后，您可以在地图方法中使用停用词。

另一个选项是使用jobrunner类中的停用词创建地图对象，
将其序列化为Base64编码的字符串，将其作为Configuration对象中某个键的值传递给映射器，并在setup方法中将其反序列化。

I选择第一个选项，不仅仅因为它更简单，而且因为通过Configuration对象传递更多数据不是一个好主意。

Hello I am currently learning Map Reduce and am trying to build a small Job with hadoop 1.0.4. I have a list of stopp words and a list of patterns. Before my files are mapped I want to load the stoppwords in an efficient Datastructure such as a map. I also want to build one regex pattern from my patternlist. Since these are serial tasks I want to do them in front of the mapping and pass every mapper a copy of those to objects which they can read/write on. I thought about simply having a static variable with a getter in my drivers class but with the java call objects as pointers principle this doesn't work out. I could of course clone the object before I pass it, but this really does not seem like a good practice. I read something about distributed cache but as far as I understood it, its only for files and not for objects and than I could just let every mapper read the stopp word/pattern files.

Thanks for any help!

解决方案

A possible solution is to copy the stopwords.txt to the HDFS before running the job, and then read it into an appropriate data structure in the Mapper's setup method. E.g:

MyMapper class:

...
private Map<String, Object> stopwords = null;

@Override
public void setup(Context context) {
    Configuration conf = context.getConfiguration();
    //hardcoded or set it in the jobrunner class and retrieve via this key
    String location = conf.get("job.stopwords.path");
    if (location != null) {
        BufferedReader br = null;
        try {
            FileSystem fs = FileSystem.get(conf);
            Path path = new Path(location);
            if (fs.exists(path)) {
                stopwords = new HashMap<String, Object>();
                FSDataInputStream fis = fs.open(path);
                br = new BufferedReader(new InputStreamReader(fis));
                String line = null;
                while ((line = br.readLine()) != null && line.trim().length() > 0) {
                    stopwords.put(line, null);
                }
            }
        }
        catch (IOException e) {
            //handle
        } 
        finally {
            IOUtils.closeQuietly(br);
        }
    }
}
...

Then you can use stopwords in your map method.

Another option is to create the map object with the stopwords in the jobrunner class, serialize it to a Base64 encoded String, pass it to the mappers as a value of some key in the Configuration object and deserialize it in the setup method.

I'd choose the first option, not just because it's easier, but because it's not a good idea to pass bigger amount of data via the Configuration object.

这篇关于将对象副本传递给hadoop中所有映射器的最佳实践的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将对象副本传递给hadoop中所有映射器的最佳实践 [英] Best practice to pass copy of object to all mappers in hadoop

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

将对象副本传递给hadoop中所有映射器的最佳实践 [英] Best practice to pass copy of object to all mappers in hadoop

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭