什么是“流式传输"?在Apache Spark和Apache Flink中是什么意思? [英] What does "streaming" mean in Apache Spark and Apache Flink?

查看:146
本文介绍了什么是“流式传输"?在Apache Spark和Apache Flink中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我访问 Apache Spark Streaming 网站时,我看到一个句子:

As I went to Apache Spark Streaming Website, I saw a sentence:

火花流技术使构建可扩展的容错流应用程序变得容易.

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

Apache Flink 网站中,有一句话:

Apache Flink是一个开源平台,用于可伸缩的批处理和流数据处理.

Apache Flink is an open source platform for scalable batch and stream data processing.

streaming applicationbatch data processingstream data processing是什么意思?你能举一些具体的例子吗?它们是为传感器数据设计的吗?

What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?

推荐答案

流数据分析(与批处理"数据分析相比)是指对无限数据项流(通常称为事件).

Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).

流数据处理应用程序通常具有以下几点特征:

Stream data processing applications are typically characterized by the following points:

  • 流应用程序长时间连续运行,并在事件出现后立即使用和处理事件.相比之下.批处理应用程序将数据收集到文件或数据库中,并在以后进行处理.

  • Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.

流应用程序经常将其自身与结果的延迟联系在一起.延迟是事件创建到分析应用程序考虑到该事件之间的延迟.

Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.

由于流是无限的,因此许多计算不能只引用整个流,而不能引用整个流的窗口".窗口是流事件的子序列(例如最近5分钟)的视图. 过去3天的平均股价" .

Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".

在流应用程序中,事件的时间通常起着特殊的作用.关于事件按时间顺序的解释是很常见的.虽然某些批处理应用程序也可以执行此操作,但它不是那里的核心概念.

In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.

流数据处理应用程序的典型示例是

Typical examples of stream data processing application are

  • 欺诈检测:应用程序尝试确定事务是否符合以前观察到的行为.如果没有,则交易可能表明尝试滥用.通常是非常关键的延迟应用程序.

  • Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.

异常检测:流应用程序为其观察到的事件建立统计模型.离群值指示异常,并可能触发警报.传感器数据可能是要分析异常的事件的一种来源.

Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.

在线建议:如果访问网上商店的用户没有很多过去的行为信息,那么在她浏览页面和浏览文章并开始生成一些行为时,从她的行为中学习是很有趣的.初步建议.

Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.

最新数据仓库:关于如何将数据仓库基础结构建模为流应用程序的有趣文章,其中事件流是对数据库的更改序列,而流应用程序计算各种仓库作为事件流的专用汇总视图".

Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.

还有更多...

这篇关于什么是“流式传输"?在Apache Spark和Apache Flink中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆