Pig:强制 UDF 在 Reducer 中发生或设置映射器数量 [英] Pig: Force UDF to occur in Reducer or set number of mappers

查看:23
本文介绍了Pig:强制 UDF 在 Reducer 中发生或设置映射器数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行非常耗时的 UDF 的 pig 脚本.Pig 似乎将 UDF 设置为作为 map 作业而不是 reduce 作业运行.结果,创建了少量次优的映射器来运行该作业.我知道我可以使用 setDefaultParallel 以及使用 PigLatin 中的 PARALELL x 命令设置在 pig 中使用的默认减速器数量来设置给定行的减速器数量.但是我该怎么做才能设置映射器的数量?我看过关于通过定义我自己的 InputSplit 大小来增加映射器数量的帖子,但我想将映射器的数量明确设置为主机数 * 内核数,文件大小不应该与它有任何关系.

I have a pig script that runs a very time consuming UDF. Pig appears to be setting the UDF to run as a map job instead of a reduce job. As a result, a suboptimally small number of mappers are getting created to run the job. I know I can set the default number of reducers to use in pig using setDefaultParallel as well as using the PARALELL x command in PigLatin to set the number of reducers for a given line. But what do I do to set the number of mappers? I've seen posts about increasing mapper count by defining my own InputSplit size, but I want to set the number of mappers explicitly to number of hosts * number of cores, filesize shouldn't have anything to do with it.

如果我无法控制映射器的数量,那么有没有办法强制我的 UDF 作为减速器出现,因为我可以控制它们?

If I can't control the number of mappers, is there anyway to force my UDF to occur as a reducer since I can control those?

推荐答案

  1. 不,您不能明确指定映射器的数量,因为 Hadoop 不能那样工作.创建的映射器的数量大约是总输入大小/输入分割大小,但是如果您有大量小文件(由于 HDFS 的工作方式,不鼓励这样做),这可能会产生偏差).因此,基本上,Pig 不允许您这样做,因为 Hadoop 根据定义没有该选项.
  2. 没有.无论如何,不​​是明确地使用 Pig.也因为它不能那样工作".猪编译&为您优化事物,输出是 MR 作业流.当下一个版本的 Pig 出现时,任何将 UDF 强制转换为减速器的技巧都可以轻松更改.如果你觉得你真的需要减速器中的 UDF,你可以创建一个自定义的 MR 作业 jar,在其中实现一个直通映射器,然后在减速器中完成你的工作.您可以使用 MAPREDUCE 命令从 pig 调用它.但是,该解决方案听起来是错误的,而且您可能误解了某些内容.你可以看看是什么迫使 Pig 得到一个好的想法——DISTINCTLIMITORDER 总是会这样做,一个GROUP 通常也是如此.JOIN 通常会同时获得一个映射器和一个化简器.如您所见,强制 reduce 的操作是利用 Hadoop 的某些内在特性的操作(例如 ORDER 处于 reduce 中,因为 reducer 输入已排序).没有简单的方法可以将 UDF 隐藏在其中,因为没有任何类型的 UDF(评估、过滤、加载、存储)可以轻松地与减速器结合在一起.
  1. No, you can not specify the number of mappers explicitly simply because Hadoop doesn't work that way. The number of mappers created is roughly total input size / input split size, but that might get skewed if you have tons of small files (which is discouraged because of how HDFS works). So basically, Pig doesn't let you do that because Hadoop doesn't have that option by definition.
  2. No. Not with Pig explicitly, anyway. Also because "it doesn't work that way". Pig compiles & optimizes things for you, the output is a stream of MR jobs. Any hacks you do to force the UDF into a reducer can easily change when the next version of Pig comes out. If you feel like you really need the UDF in a reducer, you can create a custom MR job jar, implement a drop-through mapper in that and then do your work in the reducer. You call that from pig with the MAPREDUCE command. However, the solution sounds wrong and it's possible that you're misunderstanding something. You can look at what forces a reduce for Pig to get the big idea -- a DISTINCT, LIMIT and ORDER will always do so, a GROUP will usually do as well. A JOIN will usually get both a mapper and a reducer. As you can see, the ops that force a reduce are the ones that leverage some intrinsic characteristic of Hadoop (like ORDER being in reduce because the reducer input gets sorted). There is no easy way to sneak a UDF in there, since no type of UDF (eval, filter, load, store) goes easily together with a reducer.

这篇关于Pig:强制 UDF 在 Reducer 中发生或设置映射器数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆