在Hadoop中流式传输或定制Jar [英] Streaming or custom Jar in Hadoop

查看:115
本文介绍了在Hadoop中流式传输或定制Jar的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python编写的mapper和reducer在Hadoop(在Amazon的EMR上)中运行流式作业。我想知道如果我在Java中实现相同的映射器和简化器(或使用Pig),我会体验到速度增益。



特别是,我正在寻找人们从流媒体迁移到定制jar部署和/或Pig的体验以及包含这些选项的基准比较的文档。我发现这个问题,但答案对我来说不够具体。我不想在Java和Python之间进行比较,而是在Hadoop中的自定义jar部署和基于Python的流之间进行比较。



我的工作是从Google书籍NGgram数据集和计算聚合度量。看起来计算节点上的CPU利用率接近100%。 (我希望听到您对CPU限制或IO限制作业的差异的意见)。



谢谢!



Amaç

解决方案

为什么要考虑部署自定义JAR?




  • 能够使用更强大的自定义输入格式。对于流式作业,即使您使用可插拔输入/输出(如此处,则您仅限于作为文本/字符串的映射器/缩减器的键和值。您需要花费一定数量的CPU周期来转换为您所需的类型。

  • 我也听说Hadoop可以在多个作业中重复使用JVM,这在流式处理时不可能实现。 '


何时使用猪? $ b

  • 猪拉丁语非常酷,是一种比java高得多的数据流语言/ python或perl。您的猪脚本将比通过任何其他语言编写的等效任务小得多



  • 何时不使用猪?


    • 尽管猪本身很擅长弄清楚有多少地图/缩小以及何时生成地图或缩小以及无数这样的事情,如果你确定你需要多少地图/缩小,并且你需要在Map / reduce函数中执行一些非常具体的计算,并且你对性能非常具体,那么你应该考虑部署自己的jar。这个链接表明,猪在性能上可能滞后本机hadoop M / R。你也可以看看编写你自己的Pig UDF ,它们隔离了一些计算密集函数(甚至可能使用JNI调用UDF中的某些本地C / C ++代码)


      关于IO和CPU的注意事项绑定作业:


      • 从技术上讲,hadoop和map reduce的重点在于并行化计算密集型功能,因此我假设您的地图并减少工作是计算密集型的。当数据通过网络发送时,Hadoop子系统忙于做IO的唯一时间是在映射和缩减阶段之间。另外,如果您有大量数据,并且手动配置了太少的映射并减少了导致磁盘溢出的问题(尽管太多任务会导致启动/停止JVM和太多小文件花费太多时间)。流式作业还会产生额外的开销,即启动Python / Perl VM并在JVM和脚本VM之间来回复制数据。


      I'm running a streaming job in Hadoop (on Amazon's EMR) with the mapper and reducer written in Python. I want to know about the speed gains I would experience if I implement the same mapper and reducer in Java (or use Pig).

      In particular, I'm looking for people's experiences on migrating from streaming to custom jar deployments and/or Pig and also documents containing benchmark comparisons of these options. I found this question, but the answers are not specific enough for me. I'm not looking for comparisons between Java and Python, but comparisons between custom jar deployment in Hadoop and Python-based streaming.

      My job is reading NGram counts from the Google Books NGgram dataset and computing aggregate measures. It seems like CPU utilization on the compute nodes are close to 100%. (I would like to hear your opinions about the differences of having CPU-bound or an IO-bound job, as well).

      Thanks!

      Amaç

      解决方案

      Why consider deploying custom jars ?

      • Ability to use more powerful custom Input formats. For streaming jobs, even if you use pluggable input/output like it's mentioned here, you are limited to the key and value(s) to your mapper/reducer being a text/string. You would need to expend some amount of CPU cycles to convert to your required type.
      • Ive also heard that Hadoop can be smart about reusing JVMs across multiple Jobs which wont be possible when streaming (can't confirm this)

      When to use pig ?

      • Pig Latin is pretty cool and is a much higher level data flow language than java/python or perl. Your Pig scripts WILL tend to be much smaller than an equivalent task written any of the other languages

      When to NOT use pig ?

      • Even though pig is pretty good at figuring out by itself how many maps/reduce and when to spawn a map or reduce and a myriad of such things, if you are dead sure how many maps/reduce you need and you have some very specific computation you need to do within your Map/reduce functions and you are very specific about performance, then you should consider deploying your own jars. This link shows that pig can lag native hadoop M/R in performance. You could also take a look at writing your own Pig UDFs which isolate some compute intensive function (and possibly even use JNI to call some native C/C++ code within the UDF)

      A Note on IO and CPU bound jobs :

      • Technically speaking, the whole point of hadoop and map reduce is to parallelize compute intensive functions, so i'd presume your map and reduce jobs are compute intensive. The only time the Hadoop subsystem is busy doing IO is in between the map and reduce phase when data is sent across the network. Also if you have large amount of data and you have manually configured too few maps and reduces resulting in spills to disk (although too many tasks will results in too much time spent starting / stopping JVMs and too many small files). A streaming Job would also have the additional overhead of starting a Python/Perl VM and have data being copied to and fro between the JVM and the scripting VM.

      这篇关于在Hadoop中流式传输或定制Jar的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆