DStream.foreachRDD函数的含义是什么? [英] What's the meaning of DStream.foreachRDD function?

查看:234
本文介绍了DStream.foreachRDD函数的含义是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Spark流中,每批数据间隔总是生成一个且只有一个RDD,为什么我们使用foreachRDD()来表示RDD? RDD只是一个,不需要学习. 在测试中,我从未发现RDD超过一个.

In spark streaming, every batch interval of data always generate one and only one RDD, why do we use foreachRDD() to foreach RDD? RDD is only one, needn't foreach. In my testing, I never see RDD more than one.

推荐答案

DStream或离散流"是将连续的数据流分成小块的抽象.这称为微批处理".每个微批处理都变成一个RDD,该RDD会提供给Spark进行进一步处理. 在每个批处理间隔中,每个DStream只会生成一个且只有一个RDD .

A DStream or "discretized stream" is an abstraction that breaks a continuous stream of data into small chunks. This is called "microbatching". Each microbatch becomes an RDD that is given to Spark for further processing. There's one and only one RDD produced for each DStream at each batch interval.

RDD是数据的分布式集合.可以将其视为指向实际数据在群集中的位置的一组指针.

An RDD is a distributed collection of data. Think of it as a set of pointers to where the actual data is in a cluster.

DStream.foreachRDD是Spark Streaming中的输出运算符".它允许您访问DStream的基础RDD,以执行对数据有实际作用的动作.例如,使用foreachRDD,您可以将数据写入数据库.

DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data. For example, using foreachRDD you could write data to a database.

这里的小事就是理解DStream是一个有时间限制的集合.让我将它与经典集合进行对比:列出用户列表并对其应用一个foreach:

The little mind twist here is to understand that a DStream is a time-bound collection. Let me contrast this with a classical collection: Take a list of users and apply a foreach to it:

val userList: List[User] = ???
userList.foreach{user => doSomeSideEffect(user)}

这会将副作用函数doSomeSideEffect应用于userList集合的每个元素.

This will apply the side-effecting function doSomeSideEffect to each element of the userList collection.

现在,让我们说我们现在不认识所有用户,因此我们无法建立他们的列表.取而代之的是,我们有大量的用户,例如在早上高峰时段到达咖啡店的人:

Now, let's say that we don't know all the users now, so we cannot build a list of them. Instead, we have a stream of users, like people arriving into a coffee shop during morning rush:

val userDStream: DStream[User] = ???
userDstream.foreachRDD{usersRDD => 
    usersRDD.foreach{user => serveCoffee(user)}
}

请注意:

  • DStream.foreachRDD给您一个RDD[User]不是一个用户.回到我们的咖啡示例,这是在一定时间间隔内到达的用户的集合.
  • 要访问集合的单个元素,我们需要在RDD上进行进一步的操作.在这种情况下,我使用rdd.foreach为每个用户提供咖啡.
  • the DStream.foreachRDD gives you an RDD[User], not a single user. Going back to our coffee example, that is the collection of users that arrived during some interval of time.
  • to access single elements of the collection, we need to further operate on the RDD. In this case, I'm using a rdd.foreach to serve coffee to each user.

考虑执行力:我们可能有一群咖啡师在煮咖啡.那些是我们的执行者. Spark Streaming负责制作一小部分用户(或订单),Spark会将工作分配给咖啡师,这样我们就可以并行进行咖啡制作并加快咖啡的投放速度.

To think about execution: We might have a cluster of baristas making coffee. Those are our executors. Spark Streaming takes care of making a small batch of users (or orders) and Spark will distribute the work across the baristas, so that we can parallelize the coffee making and speed up the coffee serving.

这篇关于DStream.foreachRDD函数的含义是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆