hadoop - 总映射器是如何确定的 [英] hadoop - how total mappers are determined

查看:137
本文介绍了hadoop - 总映射器是如何确定的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是hadoop的新手,刚刚安装了oracle的virtualbox和hortonworks沙箱。然后,我下载了最新版本的hadoop,并将jar文件导入到我的java程序中。我复制了一个示例wordcount程序并创建了一个新的jar文件。我将这个jar文件作为使用沙箱的作业来运行。 wordcount按预期完美地工作。但是,在我的工作状态页面中,我看到输入文件中映射器的数量被确定为28.在我的输入文件中,我有以下行。

Ramesh正在学习XXXXXXXXXX XX XXXXX XX XXXXXXXXX。



总映射器是如何确定为28的?



我在wordcount.java程序中添加了下面这行代码以检查。

  FileInputFormat.setMaxInputSplitSize(job,2); 

另外,我想知道输入文件是否只能包含2行。 (即)假设我有一个输入文件,如下所示。

row1,row2,row3,row4,row5,row6 ....... row20



我将输入文件分成20个不同的文件,每个文件只有2行?

解决方案

这意味着你的输入文件在HDFS中被拆分为大约28个部分(块) - 但是,尽管如此,不可能不是总共28个并行地图​​任务。并行性将取决于您在群集中拥有的插槽数量。我正在谈论Apache Hadoop。我不知道Horton的作品是否对此进行了修改。



Hadoop喜欢使用大文件,因此,是否要将输入文件分割为20个不同的文件档案?


I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.

Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.

How is the total mappers determined as 28?

I added the below line into my wordcount.java program to check.

FileInputFormat.setMaxInputSplitSize(job, 2);

Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.

row1,row2,row3,row4,row5,row6.......row20

Should I split the input file into 20 different files each having only 2 rows?

解决方案

That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.

Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?

这篇关于hadoop - 总映射器是如何确定的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆