如何更改 _spark_metadata 目录的位置? [英] How to change the location of _spark_metadata directory?

查看:21
本文介绍了如何更改 _spark_metadata 目录的位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Spark Structured Streaming 的流式查询使用以下代码将 Parquet 文件写入 S3:

I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:

ds.writeStream().format("parquet").outputMode(OutputMode.Append())
                .option("queryName", "myStreamingQuery")
                .option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
                .option("path", "s3a://my-data-output-bucket-name/")
                .partitionBy("createdat")
                .start();

我在 s3 存储桶 my-data-output-bucket-name 中获得了所需的输出,但与输出一起,我在其中获得了 _spark_metadata 文件夹.如何摆脱它?如果我无法摆脱它,如何将其位置更改为不同的 S3 存储桶?

I get the desired output in the s3 bucket my-data-output-bucket-name but along with the output, I get the _spark_metadata folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?

推荐答案

我的理解是在 Spark 2.3 之前不可能.

My understanding is that it is not possible up to Spark 2.3.

  1. 元数据目录的名字总是_spark_metadata

_spark_metadata 目录总是在 path 选项指向的位置

_spark_metadata directory is always at the location where path option points to

我认为修复"它的唯一方法是在 Apache Spark 的 JIRA 并希望有人会选择它.

I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.

流程是 DataSource 被请求到 创建流查询的接收器并获取path 选项.有了它,它会创建一个 FileStreamSink.path 选项简单地变成了 basePath 将结果和元数据写入其中.

The flow is that DataSource is requested to create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath where the results are written to as well as the metadata.

您可以发现初始提交对于理解元数据目录的用途非常有用.

You can find the initial commit quite useful to understand the purpose of the metadata directory.

为了正确处理部分故障,同时保持恰好一次语义,每个批次的文件都写出到一个唯一的目录,然后以原子方式附加到元数据日志中.当基于 Parquet 的 DataSource 被初始化以供读取时,我们首先检查这个日志目录,并在存在时使用它而不是文件列表.

In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

这篇关于如何更改 _spark_metadata 目录的位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆