猪：控制图的数量 [英] Pig: Control number of mappers

查看：76 发布时间：2018/5/31 19:09:41 hadoop apache-pig

本文介绍了猪：控制图的数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我可以通过在导致reducer的语句中使用PARALLEL子句来控制reducer的数量。

我想控制mappers的数量。数据源已经创建好了，我不能减少数据源中的部件数量。是否有可能控制我的猪语句产生的地图数量？我可以在产生的地图数量上保持较低和较高的上限？我试过用pig.maxCombinedSplitSize，mapred.min.split.size，mapred.tasktracker.map.tasks.maximum等来试试这个方法吗？但他们似乎没有帮助。

有人可以帮我理解如何控制地图数量并且可能共享一个工作示例吗？

解决方案
映射器的数量有一个简单的经验法则：映射器和文件分割一样多。文件分割取决于HDFS分割文件的大小（64MB，128MB，256MB取决于您的配置），请注意FileInput格式会考虑到，但可以定义它们自己的行为。

拆分很重要，因为它们与集群中数据的物理位置相关联，Hadoop将代码提供给数据，而不是代码中的数据。

当文件大小小于块大小时（64MB，128MB，256MB），问题就出现了，这意味着将会有那么多的分割是输入文件，效率不高，因为每个Map Task通常都会启动时间。在这种情况下，最好的办法是使用pig.maxCombinedSplitSize，因为它会尝试将多个小文件读入一个Mapper，而忽略分割。但是，如果将其设置得太大，就会冒数据传入代码的风险，并且会遇到网络问题。如果强制使用太少的映射器，则可能会有网络限制，因为数据必须从其他数据节点进行流式传输。保持数字接近块大小或一半，你应该没问题。

其他解决方案可能是将小文件合并成一个大的可分割文件，这将自动生成高效的Mappers数量。

I can control the number of reducers by using PARALLEL clause in the statements which result in reducers.

I want to control the number of mappers. The data source is already created, and I can not reduce the number of parts in the data source. Is it possible to control the number of maps spawned by my pig statements? Can I keep a lower and upper cap on the number of maps spawned? Is it a good idea to control this?

I tried using pig.maxCombinedSplitSize, mapred.min.split.size, mapred.tasktracker.map.tasks.maximum etc, but they seem to not help.

Can someone please help me understand how to control the number of maps and possibly share a working example?

解决方案

There is a simple rule of thumb for number of mappers: There is as many mappers as there are file splits. A file split depends on the size of the block into which you HDFS splits the files (64MB, 128MB, 256MB depending on your configuration), please note that FileInput formats take into account, but can define their own behaviour.

Splits are important, because they are tied to the physical location of the data in the cluster, Hadoop brings code to the data and not data to the code.

The problem arises when the size of the file is less than the size of the block (64MB, 128MB, 256MB), this means there will be as many splits as there are input files, which is not efficient, as each Map Task usually startup time. In this case your best bet is to use pig.maxCombinedSplitSize, as it will try to read multiple small files into one Mapper, in a way ignore splits. But if you make it too large you run a risk of bringing data to the code and will run into network issues. You could have network limitations if you force too few Mappers, as data will have to be streamed from other data nodes. Keep the number close to the block size or half of it and you should be fine.

Other solution might be to merge the small files into one large splitable file, that will automatically generate and efficient number of Mappers.

这篇关于猪：控制图的数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

猪：控制图的数量 [英] Pig: Control number of mappers

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

猪：控制图的数量 [英] Pig: Control number of mappers

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭