hadoop map reduce -archives不解压档案 [英] hadoop map reduce -archives not unpacking archives

查看:214
本文介绍了hadoop map reduce -archives不解压档案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望你能帮助我。 hadoop map-reduce有一个令人头疼的问题。我一直在map-reduce上使用-files选项,并使用hadoop 1.0.3版本。但是,当我使用-archives选项时,它复制文件,但不解压缩它们。我错过了什么?该文件说:档案(zip,tar和tgz /tar.gz文件)在从属节点中未归档,但这不是我所看到的。



我创建了3个文件 - 一个文本文件alice.txt,一个压缩文件bob.zip(包含b1.txt和bdir / b2.txt)以及一个tar文件claire.tar(包含c1.txt和cdir / c2.txt )。然后通过

调用hadoop工作

  hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/ bob.zip,./etc/claire.tar< input_path> < output_path> 

这些文件确实存在并且格式正确:

 %ls -l etc / alice.txt etc / bob.zip etc / claire.tar 
-rw-rw-r-- 1 hadoop hadoop 6 Aug 20 18:44 etc / alice.txt
-rw-rw-r-- 1 hadoop hadoop 282 Aug 20 18:44 etc / bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 8月20日18:44 etc / claire.tar
%tar tf etc / claire.tar
c1.txt
cdir / c2.txt

然后对映射文件的存在进行映射测试,如下所示,其中'lineNumber'是传递给映射器的键:

  String key = Long.toString(lineNumber.get()); 
String [] files = {
alice.txt,
bob.zip,
claire.tar,
bdir,
cdir,
b1.txt,
b2.txt,
bdir / b2.txt,
c1.txt,
c2.txt,
cdir / c2.txt
};
String fName = files [(int)(lineNumber.get()%files.length)];
字符串val = codeFile(fName);
output.collect(new Text(key),new Text(val));

支持例程'codeFile'是:

  private String codeFile(String fName){
Vector< String>子句= new Vector< String>();
clauses.add(fName);
文件f =新文件(fName);

if(!f.exists()){
clauses.add(nonexistent);
} else {
if(f.canRead())clauses.add(readable);
if(f.canWrite())clauses.add(writable);
if(f.canExecute())clauses.add(executable);
if(f.isDirectory())clauses.add(dir);
if(f.isFile())clauses.add(file);
}
返回Joiner.on(',')。join(clauses);
}

使用Guava'Joiner'类。
映射器的输出值如下所示:

  alice.txt,可读,可写,可执行文件
bob.zip可读可写可执行文件dir
claire.tar可读可写可执行文件dir
bdir不存在
b1.txt不存在
b2 .txt,不存在
bdir / b2.txt,不存在
cdir,不存在
c1.txt,不存在
c2.txt,不存在
cdir / c2.txt,不存在

所以你会看到这个问题 - 档案文件存在,但它们没有解压缩。我错过了什么?我也尝试使用DistributedCache.addCacheArchive()而不是使用-archives,但问题仍然存在。

解决方案

分布式缓存不会将档案文件解压到任务的本地工作目录 - 每个任务跟踪器上都有一个位置作为一个整体,并在那里解压缩。



您需要检查DistributedCache以查找此位置并查找其中的文件。 DistributedCache 的Javadoc展示了一个示例mapper提取这些信息。

在定义-files和-archives通用选项时,可以使用符号链接,并且将在地图的本地工作目录中创建符号链接/减少任务使这更容易:

  hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \ 
-archives ./etc/bob.zip#bob,./etc/claire.tar#claire



 新文件(

然后,您可以在映射器中使用片段名称。 bob)。isDirectory()== true


hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.

I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via

hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>

The files are indeed there and well-formed:

% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop     6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop   282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt

I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:

String key = Long.toString(lineNumber.get());
String [] files = {
    "alice.txt",
    "bob.zip",
    "claire.tar",
    "bdir",
    "cdir",
    "b1.txt",
    "b2.txt",
    "bdir/b2.txt",
    "c1.txt",
    "c2.txt",
    "cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val)); 

The support routine 'codeFile' is:

private String codeFile(String fName) {
    Vector<String> clauses = new Vector<String>();
    clauses.add(fName);
    File f = new File(fName);

    if (!f.exists()) {
        clauses.add("nonexistent");
    } else {
        if (f.canRead()) clauses.add("readable");
        if (f.canWrite()) clauses.add("writable");
        if (f.canExecute()) clauses.add("executable");
        if (f.isDirectory()) clauses.add("dir");
        if (f.isFile()) clauses.add("file");
    }
    return Joiner.on(',').join(clauses);
}

Using the Guava 'Joiner' class. The output values from the mapper look like this:

alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent

So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.

解决方案

the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.

You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.

You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:

hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
    -archives ./etc/bob.zip#bob,./etc/claire.tar#claire

And then you can use the fragment names in your mapper when trying to open files in the archive:

new File("bob").isDirectory() == true

这篇关于hadoop map reduce -archives不解压档案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆