如何更改_spark_metadata目录的位置? [英] How to change the location of _spark_metadata directory?

查看:284
本文介绍了如何更改_spark_metadata目录的位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark Structured Streaming的流查询通过以下代码将镶木地板文件写入S3:

I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:

ds.writeStream().format("parquet").outputMode(OutputMode.Append())
                .option("queryName", "myStreamingQuery")
                .option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
                .option("path", "s3a://my-data-output-bucket-name/")
                .partitionBy("createdat")
                .start();

我在s3存储桶my-data-output-bucket-name中获得了所需的输出,但是与输出一起,在其中获得了_spark_metadata文件夹.如何摆脱它?如果我无法摆脱它,如何将其位置更改为其他S3存储桶?

I get the desired output in the s3 bucket my-data-output-bucket-name but along with the output, I get the _spark_metadata folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?

推荐答案

我的理解是,在Spark 2.3之前,不可能.

My understanding is that it is not possible up to Spark 2.3.

  1. 元数据目录的名称始终为 _spark_metadata目录始终位于path选项指向的位置

    _spark_metadata directory is always at the location where path option points to

    我认为修复"的唯一方法是在 Apache Spark的JIRA ,希望有人能捡到它.

    I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.

    流程是请求DataSource

    The flow is that DataSource is requested to create the sink of a streaming query and takes the path option. With that, it goes to create a FileStreamSink. The path option simply becomes the basePath where the results are written to as well as the metadata.

    您可以找到初始提交对于理解元数据目录的用途非常有用.

    You can find the initial commit quite useful to understand the purpose of the metadata directory.

    为了在正确维护一次语义的同时正确处理部分故障,会将每个批处理的文件写到唯一的目录中,然后将其原子地附加到元数据日志中.在初始化基于木地板的DataSource以进行读取时,我们首先检查该日志目录,并使用它代替存在的文件列表.

    In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based DataSource is initialized for reading, we first check for this log directory and use it instead of file listing when present.

    这篇关于如何更改_spark_metadata目录的位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆