Spark结构化流中ForeachWriter的目的是什么? [英] What is the purpose of ForeachWriter in Spark Structured Streaming?

查看:81
本文介绍了Spark结构化流中ForeachWriter的目的是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释Spark结构化流上foreach writer的需求吗?

Can someone explains what is the need of foreach writer on spark structured streaming ?

当我们以dataFrame的形式获取所有源数据时,我没有使用foreachwriter.

As we get all source data in the form of dataFrame, i am not getting the use of foreachwriter.

推荐答案

DataFrame是抽象的Spark概念,不会直接映射为可以执行的格式,例如写入控制台或数据库.

A DataFrame is an abstract Spark concept, and does not directly map into a format that can be acted on, such as written to the console or a database.

通过创建 ForeachWriter ,您将获取DataFrame的行(或批次),并定义如何 open()您要编写的目标系统到,如何对该事件进行 process(),然后最后 close()打开的资源.

By creating a ForeachWriter, you are taking the rows (or batches) of a DataFrame, and are defining how to open() a destination system you want to write to, how to process() that event, then finally close() the opened resources.

以JDBC数据库为例,您将在 open()中建立一个数据库会话,并可能定义一个 PreparedStatement ,该映射到您要添加的数据,然后,您可以 process()一些通用类型 T 来执行所需的任何操作,例如将某些字段绑定到该语句.最后,完成后,关闭数据库连接.

Using JDBC database as an example, you would establish a database session in open(), and perhaps define a PreparedStatement which maps to the data you want to add, you can then process() some generic type T to do whatever actions you want like bind some fields to the statement. And finally, when finished, you close the database connection.

在写入控制台的情况下,实际上并没有打开或关闭的东西,但是您需要 toString DataFrame的每个字段,然后打印它

In the case of writing to the console, there is nothing really to open or close, but you would need to toString each field of the DataFrame, then print it

我觉得用例很好,文档中所述,基本上是说,对于任何不提供 writeStream.format("x")数据写入方式的系统,那么您需要自己实现此类以将数据获取到下游系统中.

The use cases, I feel, are well laid out in the documentation, and basically it is saying that for any system that doesn't offer you a writeStream.format("x") way of writing data, then you need to implement this class yourself to get data into your downstream systems.

或者,如果您需要写入多个目标,则可以在写入两个位置之前先缓存数据框,这样就无需重新计算该数据框,从而导致目标之间的数据不一致

Or, if you need to write to multiple destinations, you can cache the Dataframe before writing both locations such that the dataframe doesn't need recomputed, and result in inconsistent data between your destinations

这篇关于Spark结构化流中ForeachWriter的目的是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆