Hadoop MapReduce vs MPI(vs Spark vs Mahout vs Mesos) - 何时使用一个而不是另一个? [英] Hadoop MapReduce vs MPI (vs Spark vs Mahout vs Mesos) - When to use one over the other?

查看:26
本文介绍了Hadoop MapReduce vs MPI(vs Spark vs Mahout vs Mesos) - 何时使用一个而不是另一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是并行计算的新手,刚开始在 Amazon AWS 上试用 MPI 和 Hadoop+MapReduce.但我对何时使用其中一种感到困惑.

I am new to parallel computing and just starting to try out MPI and Hadoop+MapReduce on Amazon AWS. But I am confused about when to use one over the other.

例如,我看到的一个常见的经验法则可以总结为...

For example, one common rule of thumb advice I see can be summarized as...

  • 大数据、非迭代、容错 => MapReduce
  • 速度、小数据、迭代、非 Mapper-Reducer 类型 => MPI

但是,我也看到了 MapReduce 在 MPI (MR-MPI) 上的实现,它不提供错误容忍度,但 在某些基准测试上似乎比 Hadoop 上的 MapReduce 更有效,并且似乎使用核外内存来处理大数据.

But then, I also see implementation of MapReduce on MPI (MR-MPI) which does not provide fault tolerance but seems to be more efficient on some benchmarks than MapReduce on Hadoop, and seems to handle big data using out-of-core memory.

相反,在新一代 Hadoop Yarn 上也有 MPI 实现(MPICH2-YARN)它的分布式文件系统 (HDFS).

Conversely, there are also MPI implementations (MPICH2-YARN) on new generation Hadoop Yarn with its distributed file system (HDFS).

此外,MPI 中似乎有规定(Scatter-Gather,检查点重启, ULFM 和其他 故障容差),模仿 MapReduce 范例的几个特征.

Besides, there seems to be provisions within MPI (Scatter-Gather, Checkpoint-Restart, ULFM and other fault tolerance) that mimic several features of MapReduce paradigm.

Mahout、Mesos 和 Spark 如何融入这一切?

And how does Mahout, Mesos and Spark fit in all this?

在选择 Hadoop MapReduce、MPI、Mesos、Spark 和 Mahout(或两者的组合)时可以使用哪些标准?

What criteria can be used when deciding between (or a combo of) Hadoop MapReduce, MPI, Mesos, Spark and Mahout?

推荐答案

这个决定可能有很好的技术标准,但我还没有看到任何关于它的发布.似乎存在文化分歧,人们认为 MapReduce 用于在企业环境中筛选数据,而科学工作负载使用 MPI.这可能是由于这些工作负载对网络性能的潜在敏感性.以下是关于如何找出的一些想法:

There might be good technical criteria for this decision but I haven't seen anything published on it. There seems to be a cultural divide where it's understood that MapReduce gets used for sifting through data in corporate environments while scientific workloads use MPI. That may be due to underlying sensitivity of those workloads to network performance. Here are a few thoughts about how to find out:

许多现代 MPI 实现可以在多个网络上运行,但已针对 Infiniband 进行了大量优化.MapReduce 的典型用例似乎是通过以太网连接的白盒"商品系统集群.快速搜索MapReduce Infiniband"会导致 http://dl.acm.org/citation.cfm?id=2511027 这表明在 MapReduce 环境中使用 Infiniband 是一个相对较新的事物.

Many modern MPI implementations can run over multiple networks but are heavily optimized for Infiniband. The canonical use case for MapReduce seems to be in a cluster of "white box" commodity systems connected via ethernet. A quick search on "MapReduce Infiniband" leads to http://dl.acm.org/citation.cfm?id=2511027 which suggests that use of Infiniband in a MapReduce environment is a relatively new thing.

那么,您为什么要在针对 Infiniband 高度优化的系统上运行?它比以太网贵得多,但具有更高的带宽、更低的延迟并且在高网络争用的情况下可以更好地扩展(参考:http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

So why would you want to run on a system that's highly optimized for Infiniband? It's significantly more expensive than ethernet but has higher bandwidth, lower latency and scales better in cases of high network contention (ref: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf).

如果您的应用程序对 Infiniband 的优化效果很敏感,这些效果已经融入许多 MPI 库中,那么这可能对您有用.如果您的应用对网络性能相对不敏感,并且将更多时间花在不需要进程之间通信的计算上,那么 MapReduce 可能是更好的选择.

If you have an application that would be sensitive to those effects of optimizations for Infiniband that are already baked into many MPI libraries, maybe that would be useful for you. If your app is relatively insensitive to network performance and spends more time on computations that don't require communication between processes, maybe MapReduce is a better choice.

如果您有机会运行基准测试,您可以对可用的任何系统进行预测,以了解改进的网络性能有多大帮助.尝试限制您的网络:例如,将 GigE 降频至 100mbit 或将 Infiniband QDR 降频至 DDR,在结果中画一条线,看看购买经 MPI 优化的更快互连是否能让您到达您想去的地方.

If you have the opportunity to run benchmarks, you could do a projection on whichever system you have available to see how much improved network performance would help. Try throttling your network: downclock GigE to 100mbit or Infiniband QDR to DDR, for example, draw a line through the results and see if the purchase of a faster interconnect optimized by MPI would get you where you want to go.

这篇关于Hadoop MapReduce vs MPI(vs Spark vs Mahout vs Mesos) - 何时使用一个而不是另一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆