最佳MapReduce作业性能应该是HDFS中文件的大小 [英] What should be the size of the file in HDFS for best MapReduce job performance

查看:576
本文介绍了最佳MapReduce作业性能应该是HDFS中文件的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从外部来源拷贝文本文件到HDFS。让我们假设我可以根据文件大小合并和拆分文件,文本文件的大小应该是最佳定制的Map Reduce作业性能。大小很重要吗?

解决方案

HDFS旨在支持非常大的文件而不是小文件。与HDFS兼容的应用程序是处理大型数据集的应用程序。
这些应用程序只写入其数据一次,但是它们读取一次或多次,并且要求在流速下满足这些读取。
HDFS支持一次写多次读文件的语义。在HDFS体系结构中,有一个块的概念。 HDFS使用的典型块大小为64 MB。
当我们把一个大文件放到HDFS中时,它被分成64 MB块(基于块的默认配置),假设你有一个1GB
的文件,并且你想把这个文件放在HDFS中,那么将会有1GB / 64MB = 16的分割/块,这些块将被分配到数据节点
文件分割的目标是并行处理和数据故障转移。这些块/块将根据您的
群集配置位于不同的DataNode中。


$ b mappers如何分配



映射器的数量由MapReduce作业中的数据拆分数决定。
在典型的InputFormat中,它与文件和文件大小成正比。
假设你的HDFS块的配置为64MB(默认大小),你有一个100MB大小的文件
,那么将会有2个分割,它将占用2个块,然后2个映射器将被分配基于块,但假设
,如果你有两个30MB大小的文件(每个文件),那么每个文件将占用一个块,映射器将基于这个分配。


所以你不需要拆分大文件,但如果你正在处理非常小的文件,那么它是值得将它们合并。

这个链接将有助于了解问题小文件。

请参阅下面的链接,以获得更多关于HDFS设计的细节。 a href =http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html =nofollow noreferrer> http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html


I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?

解决方案

HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your cluster configuration.

How mappers get assigned

Number of mappers is determined by the number of splits of your data in the MapReduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes. suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.

So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.

This link will be helpful to understand the problem with small files.

Please refer below link to get more detail about HDFS design.

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

这篇关于最佳MapReduce作业性能应该是HDFS中文件的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆