使用各种语言在 Hadoop 中运行作业的优缺点是什么? [英] What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?

查看:27
本文介绍了使用各种语言在 Hadoop 中运行作业的优缺点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

到目前为止,我一直将 Pig 或 Java 用于 Map Reduce,专门用于针对 Hadoop 集群运行作业.我最近尝试通过 Hadoop 流使用 Python Map Reduce,这也很酷.所有这些对我来说都有意义,但我对何时想要使用一个实现与其他.Java map reduce,我基本上只在需要速度时使用,但是我什么时候想要使用像 Python 流这样的东西,而不是在 PIG/Hive 中用更少、更容易理解的行写出相同的东西?简而言之,每种方法的优缺点是什么?

I've been using either Pig or Java for Map Reduce exclusively for running jobs against a Hadoop cluster thus far. I've recently tried out using Python Map Reduce through the Hadoop streaming and that was pretty cool as well. All of these make sense to me, but I'm a little hazy on when I would want to use one implementation v.s. another. Java map reduce, I've been using basically exclusively when I need speed, but when would I ever want to use something like Python streaming instead of just writing out the same thing in fewer, more easily understandable lines in PIG/Hive? In short, what are the pros and cons to each?

推荐答案

我将分别介绍 Java 与 Python,然后分别介绍 MR 与 Hive/Pig - 因为我认为这是两个不同的问题
Hadoop 是围绕 Java 构建的,它的许多功能都可以通过 Java API 获得,而 Hadoop 大多可以使用 Java 类进行扩展.

Hadoop 确实有能力处理用其他语言创建的 MR 作业——它被称为流.这个模型只允许我们定义映射器和化简器,但有一些在 java 中不存在的限制.同时 - 输入/输出格式和其他插件必须编写为 java 类
所以我将决策定义如下:a) 使用 Java,除非你有严肃的代码库,你需要在你的 MR 工作中重新开始.b) 当您需要创建一些简单的临时作业时,可以考虑使用 python.

I will separately relate to Java vs Python and then separately relate to MR vs Hive / Pig - since i see it as two different issues
Hadoop is built around java and many of its capabilities available via Java API, and Hadoop mostly can be extended using java classes.

Hadoop do has capability to work with MR jobs created in other languages - it is called streaming. This model only allow us to define mapper and reducer with some restrictions not present in java. In the same time - input/output formats and other plugins do have to be written as java classes
So I would define decision making as following: a) Use Java, unless you have serious codebase you need to resue in Your MR job. b) Consider to use python when you need to create some simple ad hoc jobs.

关于 Pig/Hive - 它也是更高级别的以 Java 为中心的系统.Hive 无需任何编程即可使用,但可以使用 java 进行扩展.Pig 从一开始就需要 java.我认为这种系统在可以应用的情况下几乎总是比 MR 工作更可取.通常这些是处理类似于 SQL 的情况.

Regarding Pig / Hive - it is also java centric systems of higher level. Hive can be used without any programming at all, but it can be is extended using java. Pig require java from the beginning. I think this systems are almost always preferable to MR jobs in cases when they can be appliaed. Usually these are cases when processing is SQL like.

流式处理与原生 Java 之间的性能注意事项.
流式传输通过其输入流将输入提供给映射器.它是进程间通信,在 java 的情况下,它本质上比记录读取器和映射器之间的进程内数据传递效率低.
综上所述,我可以得出以下结论:a) 在一些轻量级处理的情况下(比如寻找子串,计数......),这种开销可能是显着的,Java 解决方案将更有效.
b) 在一些繁重的处理的情况下,这可以用一些非 java 语言更有效地实现——基于流的解决方案可能有一些优势.

Performance considerations between streaming vs native Java.
Streaming feeds input to the mapper via its input stream. It is interprocess communication which is inherently less efficient then in-process data passing between record reader and mapper in case of java.
I can make a following conclusions from above: a) In case of some light processing (like looking for substring, counting ...) this overhead can be significan and java solution will be more efficient.
b) In case of some heavy processing, which can be potentially implemented in some non-java language more efficiently - streaming based solution can have some edge.

Pig/Hive 性能注意事项.
Pig/Hive 都实现了 SQL 处理的原语.换句话说 - 他们在 RDBMS 世界中实现了执行计划的元素.这些实现很好并且经过很好的调整.同时 Hive(我更了解的东西)是解释器.它不进行代码生成——它在预构建的 MR 作业中解释执行计划.这意味着,如果您有复杂的条件,并且会专门为它们编写代码 - 它有可能比 Hive 做得更好 - 代表编译器与交互器的性能优势.

Pig / Hive performance considerations.
Pig / Hive both implements primitives of the SQL processing. In other words - they implement elements of the execution plan in the RDBMS world. These implementations are good and well tuned. In the same time Hive (something I know better) is interpreter. It does not do code generation - it inteprpret execution plan within pre-built MR job(s). It mean that if you have sompe complex condtions and will write code specially for them - it have all chances to do much better then Hive - representing performance advantage of compiler vs interpeter.

这篇关于使用各种语言在 Hadoop 中运行作业的优缺点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆