如何更改 _spark_metadata 目录的位置? [英] How to change the location of _spark_metadata directory?
问题描述
我使用 Spark Structured Streaming 的流式查询使用以下代码将 Parquet 文件写入 S3:
I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:
ds.writeStream().format("parquet").outputMode(OutputMode.Append())
.option("queryName", "myStreamingQuery")
.option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
.option("path", "s3a://my-data-output-bucket-name/")
.partitionBy("createdat")
.start();
我在 s3 存储桶 my-data-output-bucket-name
中获得了所需的输出,但与输出一起,我在其中获得了 _spark_metadata
文件夹.如何摆脱它?如果我无法摆脱它,如何将其位置更改为不同的 S3 存储桶?
I get the desired output in the s3 bucket my-data-output-bucket-name
but along with the output, I get the _spark_metadata
folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?
推荐答案
我的理解是在 Spark 2.3 之前不可能.
My understanding is that it is not possible up to Spark 2.3.
元数据目录的名字总是_spark_metadata
_spark_metadata
目录总是在 path
选项指向的位置
_spark_metadata
directory is always at the location where path
option points to
我认为修复"它的唯一方法是在 Apache Spark 的 JIRA 并希望有人会选择它.
I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.
流程是 DataSource
被请求到 创建流查询的接收器并获取path
选项.有了它,它会创建一个 FileStreamSink.path
选项简单地变成了 basePath 将结果和元数据写入其中.
The flow is that DataSource
is requested to create the sink of a streaming query and takes the path
option. With that, it goes to create a FileStreamSink. The path
option simply becomes the basePath where the results are written to as well as the metadata.
您可以发现初始提交对于理解元数据目录的用途非常有用.
You can find the initial commit quite useful to understand the purpose of the metadata directory.
为了正确处理部分故障,同时保持恰好一次语义,每个批次的文件都写出到一个唯一的目录,然后以原子方式附加到元数据日志中.当基于 Parquet 的 DataSource
被初始化以供读取时,我们首先检查这个日志目录,并在存在时使用它而不是文件列表.
In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based
DataSource
is initialized for reading, we first check for this log directory and use it instead of file listing when present.
这篇关于如何更改 _spark_metadata 目录的位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!