有人发现Cascading for Hadoop Map Reduce有用吗? [英] does anyone find Cascading for Hadoop Map Reduce useful?

查看:157
本文介绍了有人发现Cascading for Hadoop Map Reduce有用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



Map Reduce作业为我提供了更多的自由度和Cascading似乎正在制造很多障碍。



可能做出简单的事情很简单,但复杂的事情..我发现它们非常困难的工作

有什么我失踪了。 Cascading与传统方法相比有明显的优势吗?

在哪种情况下,我应该选择使用经典方法进行级联?任何人使用它,并开心?

解决方案

现在我已经使用了Cascading几年了。我觉得这非常有帮助。最终,这是关于生产力收益。与普通的java代码相比,我可以更有效地创建和维护M / R作业。原因如下:




  • 很多用于开始工作的样板代码已经为您编写。

  • 可组合性。一般而言,代码更容易阅读,并且在被编写为组件(操作)并被拼接在一起以执行更复杂的处理时更易于阅读。

  • 我发现单元测试更容易。在级联包中有一些示例,演示如何编写简单的单元测试以直接测试流的输出。
    Tap(源和汇)范例可轻松更改输入和输出例如,您可以开始输出到标准输出以进行开发和调试,然后切换到批量作业的HDFS序列文件,然后切换到HBase抽头以进行伪实时更新。

  • 编写级联作业的另一个巨大优势是,你真的写了更多的创建工作的工厂。当您需要动态构建某些内容时(例如,一个作业的结果控制您创建和运行的后续作业),这可能是一个巨大的优势。或者,在另一种情况下,我需要为每个6个二进制变量组合创建一个作业。这64个工作都非常相似。这仅仅是一个hadoop map reduce类的麻烦。



尽管可以组合许多预构建组件,如果你的处理逻辑的一个特定部分看起来像直写Java会更容易,你总是可以创建一个Cascading函数来包装它。这可以让你获得Cascading的好处,但是非常自定义的操作可以写成直接的java函数(实现Cascading接口)。


I've been trying Cascading, but I cannot see any advantage over the classic map reduce approach for writing jobs.

Map Reduce jobs gives me more freedom and Cascading seems to be putting a lot of obstacles.

Might make a good job for making simple things simple, but complex things.. I find them extremely hard

Is there something I'm missing. Is there an obvious advantage of Cascading over the classic approach?

In what scenario should I chose cascading over the classic approach? Any one using it and happy?

解决方案

I've been using Cascading for a couple of years now. I find it to be extremely helpful. Ultimately, it's about Productivity gains. I can be much more efficient in creating and maintaining M/R jobs as compared to plain java code. Here's a few reasons why:

  • A lot of the boilerplate code used to start a job is already written for you.
  • Composability. Generally code is easier to read and easier to reuse when it is written as components (operations) which are stitched together to perform some more complex processing.
  • I find unit testing to be easier. There are examples in the cascading package demonstrating how to write simple unit tests to directly test the output of flows.
  • The Tap (source and sink) paradigm makes it easy to change the input and ouput of a job, so you can, for example, start with output to STDOUT for development and debugging and then switch to HDFS sequencefiles for batch jobs and then switch to an HBase tap for pseudo-real time updates.
  • Another great advantage of writing Cascading jobs is that you're really writing more of a factory that creates jobs. This can be a huge advantage when you need to build something dynamically (i.e. the results of one job control what subsequent jobs you create and run). Or, in another case, I needed to create a job for each combination of 6 binary variables. This is 64 jobs which are all very similar. This would be a hassle with just hadoop map reduce classes.

While there are a lot of pre-built components that you can compose together, if a particular section of your processing logic seems like it would be easier to just write in straight Java, you can always create a Cascading function to wrap that. This allows you to have the benefits of Cascading, but very custom operations can be written as straight java functions (implementing a Cascading interface).

这篇关于有人发现Cascading for Hadoop Map Reduce有用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆