使用各种语言在Hadoop中工作的优点和缺点是什么? [英] What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?

查看:125
本文介绍了使用各种语言在Hadoop中工作的优点和缺点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我一直使用Pig或Java for Map Reduce专门用于针对Hadoop集群运行作业。我最近试用了通过Hadoop流式处理Python Map Reduce,这也很酷。所有这些对我来说都是有意义的,但是当我想要使用一个执行诉讼时,我有点朦胧。另一个。 Java map reduce,我基本上只在需要速度时才使用,但是我什么时候会想使用Python流,而不是在PIG / Hive中使用更少,更容易理解的行写出相同的内容?总之,每个人都有什么优点和缺点?

解决方案

我将分别涉及Java与Python,然后分别涉及到MR vs Hive / Pig - 因为我将它看作是两个不同的问题

Hadoop是围绕Java构建的,其许多功能都可以通过Java API构建,并且Hadoop大多可以使用Java类进行扩展。


Hadoop能够使用其他语言创建的MR作业 - 它被称为流式处理。这个模型只允许我们定义mapper和reducer,并且有一些在java中不存在的限制。在同一时间 - 输入/输出格式和其他插件必须写成java类

所以我将定义如下决策:
a)使用Java,除非你有严肃的代码库需要留在你的MR工作中。
b)考虑在需要创建一些简单的临时作业时使用python。关于Pig / Hive,它也是以Java为中心的更高级别的系统。 Hive可以在没有任何编程的情况下使用,但可以使用java进行扩展。猪从一开始就需要java。我认为这个系统在可以被应用的情况下总是比MR工作更可取。通常情况下,这些是处理SQL时的情况。



流与原生Java之间的性能考虑事项。
通过输入流将输入流输入到映射器。它是进程间通信,在Java的情况下,在记录读取器和映射器之间进行数据传递时,它本质上效率较低。

我可以从上面得出以下结论:
a)在一些轻量级处理的情况下(如寻找子串,计数...),这个开销可能是重要的,并且java解决方案将会更有效率。

b)如果一些繁重的处理可能更有效地在某种非Java语言中实现 - 基于流的解决方案可能会有一些优势。



Pig / Hive性能注意事项。
Pig / Hive都实现了SQL处理的基元。换句话说 - 他们在RDBMS世界中实施执行计划的元素。这些实现很好,并且很好调整。在同一时间Hive(我知道更好)是解释器。它不会执行代码生成 - 它将预先构建的MR作业中的执行计划集成在一起。这意味着如果你有复杂的状态并且会为他们专门编写代码 - 它有一切机会做得更好,然后Hive - 代表编译器vs中间件的性能优势。


I've been using either Pig or Java for Map Reduce exclusively for running jobs against a Hadoop cluster thus far. I've recently tried out using Python Map Reduce through the Hadoop streaming and that was pretty cool as well. All of these make sense to me, but I'm a little hazy on when I would want to use one implementation v.s. another. Java map reduce, I've been using basically exclusively when I need speed, but when would I ever want to use something like Python streaming instead of just writing out the same thing in fewer, more easily understandable lines in PIG/Hive? In short, what are the pros and cons to each?

解决方案

I will separately relate to Java vs Python and then separately relate to MR vs Hive / Pig - since i see it as two different issues
Hadoop is built around java and many of its capabilities available via Java API, and Hadoop mostly can be extended using java classes.

Hadoop do has capability to work with MR jobs created in other languages - it is called streaming. This model only allow us to define mapper and reducer with some restrictions not present in java. In the same time - input/output formats and other plugins do have to be written as java classes
So I would define decision making as following: a) Use Java, unless you have serious codebase you need to resue in Your MR job. b) Consider to use python when you need to create some simple ad hoc jobs.

Regarding Pig / Hive - it is also java centric systems of higher level. Hive can be used without any programming at all, but it can be is extended using java. Pig require java from the beginning. I think this systems are almost always preferable to MR jobs in cases when they can be appliaed. Usually these are cases when processing is SQL like.

Performance considerations between streaming vs native Java.
Streaming feeds input to the mapper via its input stream. It is interprocess communication which is inherently less efficient then in-process data passing between record reader and mapper in case of java.
I can make a following conclusions from above: a) In case of some light processing (like looking for substring, counting ...) this overhead can be significan and java solution will be more efficient.
b) In case of some heavy processing, which can be potentially implemented in some non-java language more efficiently - streaming based solution can have some edge.

Pig / Hive performance considerations.
Pig / Hive both implements primitives of the SQL processing. In other words - they implement elements of the execution plan in the RDBMS world. These implementations are good and well tuned. In the same time Hive (something I know better) is interpreter. It does not do code generation - it inteprpret execution plan within pre-built MR job(s). It mean that if you have sompe complex condtions and will write code specially for them - it have all chances to do much better then Hive - representing performance advantage of compiler vs interpeter.

这篇关于使用各种语言在Hadoop中工作的优点和缺点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆