多次输入到hadoop中的Mapper中 [英] multiple input into a Mapper in hadoop
问题描述
我尝试过使用DistributedCache,但是在main中使用addCacheFile的任何东西似乎都没有在映射器中返回给getLocalCacheFiles。
现在我我正在使用FileSystem来读取文件,但我正在本地运行,因此我只能发送文件的名称。想知道如果我在真正的hadoop系统上运行,该怎么做。
是否有将值发送给映射器除了它正在读取的文件?
我也有很多分发缓存问题,并且发送参数。选项为我工作如下:
对于分布式缓存使用:
对于我来说,在Map或HDFS中获取url / path是一个噩梦减少,但是使用符号链接,它可以在作业的run()方法中工作
$ b $ $ $ $ $ $ $ $ $ $ $> DistributedCache.addCacheFile(new URI文件+#rules.dat),conf);
DistributedCache.createSymlink(conf);
然后在方法之前在Map或Reduce
中读取
public static FileSystem hdfs;
然后在Map或Reduce的setup()方法中使用
hdfs = FileSystem.get(new Configuration())。open(new Path(rules.dat));
参数:
将一些值发送到Map或Reduce(可以是要打开的文件名来自HDFS):
public int run(String [] args)throws Exception {
Configuration conf = new Configuration( );
...
conf.set(level,otherArgs [2]); //从命令行设置变量级别,它可以是文件名
...
}
然后在Map或Reduce类中:
int level = Integer.parseInt(conf.get(level )); //这是int,但你也可以读取字符串等。
I am trying to send two files to a hadoop reducer. I tried DistributedCache, but anything I put using addCacheFile in main, doesn't seem to be given back to with getLocalCacheFiles in the mapper.
right now I am using FileSystem to read the file, but I am running locally so I am able to just send the name of the file. Wondering how to do this if I was running on a real hadoop system.
is there anyway to send values to the mapper except the file that it's reading?
I also had a lot of problems with distribution cache, and sending parameters. Options worked for me are below:
For distributed cache usage: For me it was a nightmare to get the url/path to file on HDFS in Map or Reduce, but with symlink it worked in run() method of the job
DistributedCache.addCacheFile(new URI(file+"#rules.dat"), conf);
DistributedCache.createSymlink(conf);
and then read in Map or Reduce in header, before methods
public static FileSystem hdfs;
and then in setup() method of Map or Reduce
hdfs = FileSystem.get(new Configuration()).open(new Path ("rules.dat"));
For parameters: Send some values to Map or Reduce (could be a filename to open from HDFS):
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
...
conf.set("level", otherArgs[2]); //sets variable level from command line, it could be a filename
...
}
then in Map or Reduce class just:
int level = Integer.parseInt(conf.get("level")); //this is int, but you can read also strings, etc.
这篇关于多次输入到hadoop中的Mapper中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!