猪:强制UDF出现在Reducer或设置的映射器数中 [英] Pig: Force UDF to occur in Reducer or set number of mappers

查看:87
本文介绍了猪:强制UDF出现在Reducer或设置的映射器数中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行非常耗时的UDF的Pig脚本. Pig似乎正在将UDF设置为作为地图作业而非还原作业运行.结果,创建了次优的映射器以运行该作业.我知道我可以使用setDefaultParallel以及可以使用PigLatin中的PARALELL x命令设置给定行的reduce的数量来设置在Pig中使用的reduce的默认数量.但是,我该如何设置映射器的数量?我看过有关通过定义自己的InputSplit大小来增加映射器数量的帖子,但是我想将映射器的数量显式设置为主机数*内核数,文件大小与它无关.

I have a pig script that runs a very time consuming UDF. Pig appears to be setting the UDF to run as a map job instead of a reduce job. As a result, a suboptimally small number of mappers are getting created to run the job. I know I can set the default number of reducers to use in pig using setDefaultParallel as well as using the PARALELL x command in PigLatin to set the number of reducers for a given line. But what do I do to set the number of mappers? I've seen posts about increasing mapper count by defining my own InputSplit size, but I want to set the number of mappers explicitly to number of hosts * number of cores, filesize shouldn't have anything to do with it.

如果我无法控制映射器的数量,是否可以强制我的UDF作为化简器,因为我可以控制这些映射器?

If I can't control the number of mappers, is there anyway to force my UDF to occur as a reducer since I can control those?

推荐答案

  1. 不,您可以明确指定映射器的数量,这仅仅是因为Hadoop无法以这种方式工作.创建的映射器的数量大约为total input size/input split size,但是如果您有大量的小文件(由于HDFS的工作方式而被劝阻),则映射器的数量可能会出现偏差.因此,基本上,Pig不允许您这样做,因为Hadoop在定义上没有该选项.
  2. 不.无论如何,不​​明确地与Pig隔离.也因为这种方式不起作用".猪编译为您优化事物,输出是MR作业流.下一个版本的Pig出现时,您为使UDF变成减速器而进行的任何改动都可以轻松更改.如果您确实感觉到在reducer中确实需要UDF,则可以创建一个自定义的MR作业jar,在其中实现一个插入式映射器,然后在reducer中进行工作.您可以使用MAPREDUCE命令从pig调用它.但是,解决方案听起来是错误的,并且您可能误解了某些内容.您可以看看是什么促使减小Pig的想法,使DISTINCTLIMITORDER总是可以,而GROUP通常也可以. JOIN通常将同时获得一个映射器和一个简化器.如您所见,强制执行减少操作的操作是利用Hadoop的某些固有特性的操作(例如ORDER在减少操作中,因为减少程序输入已排序).没有一种简单的方法可以将UDF潜入其中,因为没有哪种类型的UDF(评估,过滤,加载,存储)与reducer一起使用很容易.
  1. No, you can not specify the number of mappers explicitly simply because Hadoop doesn't work that way. The number of mappers created is roughly total input size / input split size, but that might get skewed if you have tons of small files (which is discouraged because of how HDFS works). So basically, Pig doesn't let you do that because Hadoop doesn't have that option by definition.
  2. No. Not with Pig explicitly, anyway. Also because "it doesn't work that way". Pig compiles & optimizes things for you, the output is a stream of MR jobs. Any hacks you do to force the UDF into a reducer can easily change when the next version of Pig comes out. If you feel like you really need the UDF in a reducer, you can create a custom MR job jar, implement a drop-through mapper in that and then do your work in the reducer. You call that from pig with the MAPREDUCE command. However, the solution sounds wrong and it's possible that you're misunderstanding something. You can look at what forces a reduce for Pig to get the big idea -- a DISTINCT, LIMIT and ORDER will always do so, a GROUP will usually do as well. A JOIN will usually get both a mapper and a reducer. As you can see, the ops that force a reduce are the ones that leverage some intrinsic characteristic of Hadoop (like ORDER being in reduce because the reducer input gets sorted). There is no easy way to sneak a UDF in there, since no type of UDF (eval, filter, load, store) goes easily together with a reducer.

这篇关于猪:强制UDF出现在Reducer或设置的映射器数中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆