在MapReduce期间发生磁盘溢出 [英] Disk Spill during MapReduce

查看:404
本文介绍了在MapReduce期间发生磁盘溢出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常基本的问题,我正试图找到答案。我正在浏览文档以了解在映射阶段,混洗阶段和缩小阶段数据溢出的位置?如果映射器A具有16 GB的RAM,但如果映射器的已分配内存已超过,则数据将被泄漏。

数据溢出到HDFS上还是数据会溢出到磁盘上的tmp文件夹中?
在shuffle阶段,数据是从一个节点流到另一个节点的,并存储在HDFS或临时存储位置。



我之所以问这些问题是要弄清楚工作完成后是否需要清理过程。

Mapper的中间文件(溢出文件)存储在Mapper运行的工作节点的本地文件系统中。类似地,从一个节点流向另一个节点的数据存储在任务运行的工作节点的本地文件系统中。

本地文件系统路径由 hadoop.tmp.dir 属性,默认为'/ tmp'。



在作业完成或失败后,在本地文件系统上使用的临时位置get会自动清除,您不必执行清理过程,它会自动由框架处理。


I have a pretty basic question that I am trying to find an answer for. I was looking through the documentation to understand where is the data spilled to during the map phase, shuffle phase and reduce phase? As in if Mapper A has 16 GB of RAM, but if the allocated memory for a mapper has exceeded then the data is spilled.

Is the data spilled to HDFS or will the data be spilled to a tmp folder on the disk? During the shuffle phase, is the data streamed from one node to another node and is stored in HDFS or in a temporary storage location.

The reason I ask these questions is to figure out if there needs to be a clean up process after the job is done. Please help.

解决方案

Mapper's intermediate files (spilled files) are stored in the local filesystem of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local filesystem of the worker node where the task is running.

This local filesystem path is specified by hadoop.tmp.dir property which by default is '/tmp'.

And after the completion or failure of the job the temporary location used on the local filesystem get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.

这篇关于在MapReduce期间发生磁盘溢出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆