什么是Apache的星火和Apache弗林克之间的区别? [英] What is the difference between Apache Spark and Apache Flink?

查看:362
本文介绍了什么是Apache的星火和Apache弗林克之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么是的 阿帕奇星火 的和的 阿帕奇弗林克 <之间的差异/ EM>?

威尔的 阿帕奇弗林克 的取代的 的Hadoop 的<? / p>

解决方案

起初,他们怎么有什么共同点?弗林克和星火都是通用数据处理平台和Apache软件基金会(ASF)的顶级项目。他们有应用广泛的领域,并为数十家大数据方案使用。多亏了像SQL查询扩展(火花:火花SQL,弗林克:MRQL),图形处理(火花:GraphX​​,弗林克:Spargel(基地)和冻膜(库)),学习机(火花:MLlib,弗林克:弗林克ML)和流处理(星火流,流弗林克)。两者都能够在独立模式下运行,但许多正在使用他们在Hadoop(纱,HDFS)的顶部。他们有着强劲的表现,由于其在内存中的性质。

不过,他们实现这一品种和他们的专业上有所不同。在案件的方式

的区别:
起初我想提供两个环节里面去在弗林克和星火之间的差异的一些细节总结起来之前。如果你有时间看看阿帕奇弗林克是BigData分析框架的4G和的Flink和Spark异同

差异

在对比弗林克,Spark是不是能够处理的数据集比1.5.x版

在RAM中大

弗林克是循环或迭代过程中通过使用集合迭代优化的转换。这是通过连接算法,运营商链接和划分和排序重用的优化来实现的。然而,弗林克也是批量处理的有力工具。弗林克流处理数据流为真流,即数据元素会立即流水线虽然流媒体节目,尽快为他们到达。这使得在流进行灵活的窗口操作。此外弗林克提供了一个非常强大的兼容模式,这使得它可以使用现有的风暴,地图减轻,......对弗林克执行引擎code

在另一方面星火基于弹性分布式数据集(RDDS)。这个(大部分)的内存数据结构给予火花函数式编程范式的权力。它是由钉住内存能够大批量计算。火花流封装数据流为微型批次,即,它收集的一定时间内到达的所有数据和运行在所收集的数据定期批处理程序。而批处理程序正在运行时,为下一个小批量的数据被收集。

威尔·弗林克取代Hadoop的?

没有,也不会。 Hadoop的由不同的部分:


  • HDFS - Hadoop的分布式文件系统

  • 纱线 - 然而,另一个资源谈判(或资源管理器)

  • 麻preduce - 在批量处理框架的Hadoop

HDFS和纱线仍然是必要的,因为BigData集群的组成部分。这两个正在为其他的分布式技术,如分布式查询引擎或分布式数据库的基础。主要用例为马preduce是批量处理的数据集比簇的RAM中而弗林克设计用于迭代处理大。所以一般这两个是可以并存的。

What are the differences between Apache Spark and Apache Flink?

Will Apache Flink replace Hadoop?

解决方案

At first what do they have in common? Flink and Spark are both general-purpose data processing platforms and top level projects of the Apache Software Foundation (ASF). They have a wide field of application and are usable for dozens of big data scenarios. Thanks to expansions like SQL queries (Spark: Spark SQL, Flink: MRQL), Graph processing (Spark: GraphX, Flink: Spargel (base) and Gelly(library)), machine learning (Spark: MLlib, Flink: Flink ML) and stream processing (Spark Streaming, Flink Streaming). Both are capable of running in standalone mode, yet many are using them on top of Hadoop (YARN, HDFS). They share a strong performance due to their in memory nature.

However, the way they achieve this variety and the cases they are specialized on differ.

Differences: At first I'd like to provide two links which go in some detail on differences between Flink and Spark before summing it up. If you have the time have a look at Apache Flink is the 4G of BigData Analytics Framework and Flink and Spark Similarities and Differences

In contrast to Flink, Spark is not capable of handling data sets larger than the RAM before version 1.5.x

Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams, i.e., data elements are immediately "pipelined" though a streaming program as soon as they arrive. This allows to perform flexible window operations on streams. Furthermore Flink provides a very strong compatibility mode which makes it possible to use your existing storm, map reduce, ... code on the flink execution engine

Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory datastructure gives the power to sparks functional programming paradigm. It is capable of big batch calculations by pinning memory. Spark streaming wraps data streams into mini-batches, i.e., it collects all data that arrives within a certain period of time and runs a regular batch program on the collected data. While the batch program is running, the data for the next mini-batch is collected.

Will Flink replace Hadoop?

No, it will not. Hadoop consists of different parts:

  • HDFS - Hadoop Distributed Filesystem
  • YARN - Yet Another Resource Negotiator (or Resource Manager)
  • MapReduce - The batch processing Framework of Hadoop

HDFS and YARN are still necessary as integral part of BigData clusters. Those two are building the base for other distributed technologies like distributed query engines or distributed databases. The main use-case for MapReduce is batch processing for data sets larger than the RAM of the cluster while Flink is designed for iterative processing. So in general those two can co-exist.

这篇关于什么是Apache的星火和Apache弗林克之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆