hadoop map reduce -archives 不解压档案 [英] hadoop map reduce -archives not unpacking archives

查看:31
本文介绍了hadoop map reduce -archives 不解压档案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望你能帮助我.我在使用 hadoop map-reduce 时遇到了一个令人头疼的问题.我一直在 map-reduce 上成功使用-files"选项,hadoop 版本为 1.0.3.但是,当我使用-archives"选项时,它会复制文件,但不会解压缩它们.我错过了什么?文档说档案(zip、tar 和 tgz/tar.gz 文件)在从节点上未归档",但这不是我所看到的.

hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.

我创建了 3 个文件 - 一个文本文件alice.txt"、一个 zip 文件bob.zip"(包含 b1.txt 和 bdir/b2.txt)和一个 tar 文件claire.tar"(包含c1.txt 和 cdir/c2.txt).然后我通过

I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via

hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>

文件确实存在并且格式良好:

The files are indeed there and well-formed:

% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop     6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop   282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt

然后我让我的映射器测试有问题的文件是否存在,就像这样,其中lineNumber"是传递给映射器的键:

I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:

String key = Long.toString(lineNumber.get());
String [] files = {
    "alice.txt",
    "bob.zip",
    "claire.tar",
    "bdir",
    "cdir",
    "b1.txt",
    "b2.txt",
    "bdir/b2.txt",
    "c1.txt",
    "c2.txt",
    "cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val)); 

支持例程'codeFile'是:

The support routine 'codeFile' is:

private String codeFile(String fName) {
    Vector<String> clauses = new Vector<String>();
    clauses.add(fName);
    File f = new File(fName);

    if (!f.exists()) {
        clauses.add("nonexistent");
    } else {
        if (f.canRead()) clauses.add("readable");
        if (f.canWrite()) clauses.add("writable");
        if (f.canExecute()) clauses.add("executable");
        if (f.isDirectory()) clauses.add("dir");
        if (f.isFile()) clauses.add("file");
    }
    return Joiner.on(',').join(clauses);
}

使用 Guava 'Joiner' 类.映射器的输出值如下所示:

Using the Guava 'Joiner' class. The output values from the mapper look like this:

alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent

所以你看到了问题 - 存档文件在那里,但它们没有被解压.我错过了什么?我也尝试过使用 DistributedCache.addCacheArchive() 而不是使用 -archives,但问题仍然存在.

So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.

推荐答案

分布式缓存不会将存档文件解压到您的任务的本地工作目录 - 每个任务跟踪器上都有一个位置作为一个整体,并且它在那里打开.

the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.

您需要检查 DistributedCache 以找到此位置并查找那里的文件.DistributedCache 的 Javadoc 显示了一个示例提取此信息的映射器.

You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.

您可以在定义 -files 和 -archives 通用选项时使用符号链接,并且将在 map/reduce 任务的本地工作目录中创建一个符号链接,使其更容易:

You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:

hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
    -archives ./etc/bob.zip#bob,./etc/claire.tar#claire

然后您可以在尝试打开存档中的文件时使用映射器中的片段名称:

And then you can use the fragment names in your mapper when trying to open files in the archive:

new File("bob").isDirectory() == true

这篇关于hadoop map reduce -archives 不解压档案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆