数据如何在Hadoop中拆分 [英] How the data is split in Hadoop

查看:77
本文介绍了数据如何在Hadoop中拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hadoop是否根据程序中设置的映射器数量拆分数据?也就是说,如果一个数据集的大小为500MB,如果映射器的数量为200(假设Hadoop集群同时允许200个映射器),那么每个映射器是否获得2.5 MB的数据?

Does the Hadoop split the data based on the number of mappers set in the program? That is, having a data set of size 500MB, if the number of mappers is 200 (assuming that the Hadoop cluster allows 200 mappers simultaneously), is each mapper given 2.5 MB of data?

此外,所有映射器是否同时运行,或者某些映射器可能会串行运行?

Besides,do all the mappers run simultaneously or some of them might get run in serial?

推荐答案

我只是根据您的问题运行了一个示例MR程序,这是我的发现

I just ran a sample MR program based on your question and here is my finding

输入:小于块大小的文件.

Input: a file smaller that block size.

情况1:映射器数量= 1结果:启动了1个映射任务.输入分割每个映射器的大小(在这种情况下,只有一个)与输入文件相同大小.

Case 1: Number of mapper =1 Result : 1 map task launched. Inputsplit size for each mapper(in this case only one) is same as the input file size.

情况2:映射器数量= 5结果:启动了5个映射任务.每个映射器的Inputsplit大小是输入文件大小的五分之一.

Case 2: Number of mappers = 5 Result : 5 map tasks launched. Inputsplit size for each mapper is one fifth of the input file size.

情况3:映射器数量= 10结果:启动了10个映射任务.每个映射器的Inputsplit大小是输入文件大小的十分之一.

Case 3: Number of mappers = 10 Result : 10 map tasks launched. Inputsplit size for each mapper is one 10th of the input file size.

因此,根据以上所述,对于小于块大小的文件,

So based on above, for file less then block size,

分割大小=输入文件的总大小/启动的地图任务的数量.

注意:但是请记住,没有.地图任务的确定取决于输入拆分.

这篇关于数据如何在Hadoop中拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆