使用Hadoop在datanode上写入临时文件时遇到困难 [英] Troubles writing temp file on datanode with Hadoop

查看:296
本文介绍了使用Hadoop在datanode上写入临时文件时遇到困难的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的程序中创建一个文件。然而,我不希望这个文件写在HDFS上,而是写在执行 map 操作的datanode文件系统中。



我尝试了以下方法:

pre $ public void map(Object key,Text value,Context context)
抛出IOException,InterruptedException {
//做一些hadoop的东西,比如计算单词
String path =newFile.txt;
尝试{
文件f =新文件(路径);
f.createNewFile();
} catch(IOException e){
System.out.println(消息容易在日志中查找。);
System.err.println(容易在日志中查找错误。);
e.printStackTrace();
throw e;




$ b使用绝对路径,我得到了它的文件应该是。使用相对路径,无论在我运行程序的控制台还是作业日志中,此代码都不会产生任何错误。但是,我无法找到应该创建的文件(现在,我正在处理本地群集)。

任何想法可以找到文件或错误信息?如果确实有错误信息,我应该如何继续将文件写入datanodes的本地文件系统?

newFile.txt is一个相对路径,因此该文件将显示相对于您的地图任务进程的工作目录。这将在NodeManager使用的容器的目录下的某个地方。这是yarn-site.xml中的配置属性 yarn.nodemanager.local-dirs ,或者是在/ tmp下的yarn-default.xml继承的默认值:

 <属性> 
< description>存储本地化文件的目录列表。
应用程序的本地化文件目录位于:
$ {yarn.nodemanager.local-dirs} / usercache / $ {用户} /应用程序缓存/应用_ $ {} APPID。
个别容器的工作目录(称为容器_ $ {contid})将
作为其子目录。
< / description>
< name> yarn.nodemanager.local-dirs< / name>
< value> $ {hadoop.tmp.dir} / nm-local-dir< /值>
< / property>

以下是我的测试环境中一个这样的目录的具体示例:

  / tmp / hadoop -cnauroth / nm-local-dir / usercache / cnauroth / appcache / application_1363932793646_0002 / container_1363932793646_0002_01_000001 

这些目录是容器执行的暂存空间,所以它们不是您可以依赖的用于持久化的东西。后台线程会定期删除已完成容器的这些文件。可以通过在yarn-site.xml中设置配置属性 yarn.nodemanager.delete.debug-delay-sec 来延迟清理:

 <属性> 
< description>
应用程序在nodemanager的
之前完成后的秒数DeletionService将删除应用程序的本地化文件目录
和日志目录。

为了诊断Yarn应用程序问题,将此属性的值设置为
足够大(例如,为600 = 10分钟)以允许检查这些
目录。更改属性值后,必须重新启动
nodemanager才能生效。

Yarn应用程序工作目录的根目录可以使用
yarn.nodemanager.local-dirs属性(参见下文)和Yarn应用程序日志的根目录
进行配置目录可用
yarn.nodemanager.log-dirs属性进行配置(另请参见下文)。
< / description>
< name> yarn.nodemanager.delete.debug-delay-sec< / name>
<值> 0< /值>
< / property>

但请记住,此配置仅用于解决问题,以便您可以看到目录更容易。不建议将其作为永久生产配置。如果应用程序逻辑依赖于删除延迟,那么很可能会导致尝试访问该目录的应用程序逻辑与试图删除它的NodeManager之间的竞争状态。将文件留在旧的容器执行中也可能会使本地磁盘空间混乱。



日志消息将转到映射任务日志的stdout / stderr,但我怀疑执行没有打到这些日志消息。相反,我怀疑你正在成功创建文件,但要么不容易找到(目录结构会有一些不可预知的事情,例如YARN管理的应用程序ID和容器ID),要么在获取文件之前清理文件。

如果您将代码更改为使用指向其他目录的绝对路径,那么这将有所帮助。但是,我不认为这种方法在实际操作中能够很好地工作。由于Hadoop是分布式的,因此可能很难找到数百或数千个群集中的哪个节点包含所需的本地文件。相反,您最好写信给HDFS,然后将本地需要的文件拖到您启动作业的节点。


I would like to create a file during my program. However, I don't want this file to be written on HDFS but on the datanode filesystem where the map operation is executed.

I tried the following approach :

public void map(Object key, Text value, Context context)
        throws IOException, InterruptedException {
    // do some hadoop stuff, like counting words
    String path = "newFile.txt";
    try {
        File f = new File(path);
        f.createNewFile();
    } catch (IOException e) {
        System.out.println("Message easy to look up in the logs.");
        System.err.println("Error easy to look up in the logs.");
        e.printStackTrace();
        throw e;
    }
}

With an absolute path, I get the file where it's supposed to be. With a relative path, howver, this code doesn't produce any error, neither in the console from which I run the program nor in the job logs. However, I can't manage to find the file which should be created (Right now, I'm working on a local cluster).

Any ideas where to find either the file or the error message ? If there is indeed an error message, how should I proceed to write files to local filesystem of datanodes ?

解决方案

newFile.txt is a relative path, so the file would show up relative to your map task process's working directory. This will land somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp:

<property>
  <description>List of directories to store localized files in. An 
    application's localized file directory will be found in:
    ${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
    Individual containers' work directories, called container_${contid}, will
    be subdirectories of this.
  </description>
  <name>yarn.nodemanager.local-dirs</name>
  <value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>

Here is a concrete example of one such directory in my test environment:

/tmp/hadoop-cnauroth/nm-local-dir/usercache/cnauroth/appcache/application_1363932793646_0002/container_1363932793646_0002_01_000001

These directories are scratch space for container execution, so they aren't something that you can rely on for persistence. A background thread periodically deletes these files for completed containers. It is possible to delay the cleanup by setting the configuration property yarn.nodemanager.delete.debug-delay-sec in yarn-site.xml:

<property>
  <description>
    Number of seconds after an application finishes before the nodemanager's 
    DeletionService will delete the application's localized file directory
    and log directory.

    To diagnose Yarn application problems, set this property's value large
    enough (for example, to 600 = 10 minutes) to permit examination of these
    directories. After changing the property's value, you must restart the 
    nodemanager in order for it to have an effect.

    The roots of Yarn applications' work directories is configurable with
    the yarn.nodemanager.local-dirs property (see below), and the roots
    of the Yarn applications' log directories is configurable with the 
    yarn.nodemanager.log-dirs property (see also below).
  </description>
  <name>yarn.nodemanager.delete.debug-delay-sec</name>
  <value>0</value>
</property>

However, please keep in mind that this configuration is intended only for troubleshooting issues so that you can see the directories more easily. It's not recommended as a permanent production configuration. If application logic depends on the delete delay, then that's likely to cause a race condition between the application logic attempting to access the directory and the NodeManager attempting to delete it. Leaving files lingering from old container executions also risks cluttering the local disk space.

The log messages would go to the stdout/stderr of the map task logs, but I suspect execution isn't hitting those log messages. Instead, I suspect that you're creating the file successfully, but either it's not easily findable (the directory structure will have somewhat unpredictable things like application ID and container ID managed by YARN), or the file is getting cleaned up before you can get to it.

If you changed the code to use an absolute path pointing to some other directory, then that would help. However, I don't expect this approach to work well in real practice. Since Hadoop is distributed, you may have a hard time finding which node in a cluster of hundreds or thousands contains the local file that you want. Instead, you might be better off writing to HDFS and then pulling the files you need locally to the node where you launched the job.

这篇关于使用Hadoop在datanode上写入临时文件时遇到困难的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆