“流媒体"是什么?在 Apache Spark 和 Apache Flink 中是什么意思? [英] What does "streaming" mean in Apache Spark and Apache Flink?

查看:28
本文介绍了“流媒体"是什么?在 Apache Spark 和 Apache Flink 中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我去Apache Spark Streaming网站时,看到一句话:

As I went to Apache Spark Streaming Website, I saw a sentence:

Spark Streaming 使构建可扩展的容错流应用程序变得容易.

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

而在Apache Flink网站上,有一句话:

And in Apache Flink Website, there is a sentence:

Apache Flink 是一个开源平台,用于可扩展的批处理和流数据处理.

Apache Flink is an open source platform for scalable batch and stream data processing.

流应用批量数据处理流数据处理是什么意思?你能举一些具体的例子吗?它们是为传感器数据而设计的吗?

What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?

推荐答案

流式数据分析(与批量"数据分析相反)是指对典型的无限连续分析/strong> 数据项流(通常称为事件).

Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).

流数据处理应用通常具有以下特点:

Stream data processing applications are typically characterized by the following points:

  • 流应用程序会持续运行很长时间,并在事件出现时立即使用和处理它们.相比之下.批处理应用程序在文件或数据库中收集数据并在以后进行处理.

  • Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.

流式应用程序经常关心结果的延迟.延迟是从创建事件到分析应用程序考虑该事件之间的延迟.

Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.

因为流是无限的,许多计算不能引用整个流,而是引用流上的一个窗口".窗口是流事件的子序列(例如最后 5 分钟)的视图.现实世界窗口统计的一个示例是过去 3 天的平均股价".

Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".

在流媒体应用程序中,事件的时间通常起着特殊的作用.根据事件的时间顺序来解释事件是很常见的.虽然某些批处理应用程序也可以这样做,但它不是那里的核心概念.

In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.

流数据处理应用的典型例子是

Typical examples of stream data processing application are

  • 欺诈检测:应用程序尝试确定交易是否符合之前观察到的行为.如果不是,则交易可能表明企图滥用.通常对延迟非常重要的应用程序.

  • Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.

异常检测:流式应用程序为其观察到的事件构建统计模型.异常值表示异常并可能触发警报.传感器数据可能是人们想要分析异常的事件来源之一.

Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.

在线推荐者:如果访问网店的用户没有很多过去的行为信息,那么从她浏览页面和浏览文章时的行为中学习并开始生成一些直接提供初步建议.

Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.

最新数据仓库:有一些有趣的文章介绍了如何将数据仓库基础设施建模为流应用程序,其中事件流是对数据库的更改序列,流应用程序计算各种仓库作为事件流的专门聚合视图".

Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.

还有更多……

这篇关于“流媒体"是什么?在 Apache Spark 和 Apache Flink 中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆