在Spark 2.1.0中启用_metadata文件 [英] Enable _metadata files in Spark 2.1.0

查看:101
本文介绍了在Spark 2.1.0中启用_metadata文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark 2.1.0中似乎无法保存空的Parquet文件,因为无法再次读取它们(由于错误的模式推断)

It seems that saving empty Parquet files is broken in Spark 2.1.0 as it is not possible to read them in again (due to faulty schema inference)

我发现,自从Spark 2.0写入_metadata文件起,在编写镶木地板文件时默认情况下被禁用.但是我找不到将其重新设置的配置设置.

I found that since Spark 2.0 writing the _metadata file is disabled by default when writing parquet files. But I cannot find the configuration setting to put this back on.

我尝试了以下操作:

spark_session = SparkSession.builder \
                        .master(url) \
                        .appName(name) \
                        .config('spark.hadoop.parquet.enable.summary-metadata', 'true') \
                        .getOrCreate()

和其他一些非常不同的组合,例如没有 spark.hadoop 的情况.

and quite some different combination as without spark.hadoop for example.

我试图在PySpark中运行的代码:

The code I am trying to run in PySpark:

spark_session = session.get_session()
sc = spark_session.sparkContext

df = spark_session.createDataFrame(sc.emptyRDD(), schema)

df.write.mode('overwrite').parquet(path, compression='none')

# this works
df = spark_session.read.schema(schema).parquet(path)

# This throws an error
df = spark_session.read.parquet(path)

推荐答案

sc.emptyRDD()的行为存在问题.您可以在 https://github.com/apache/spark/pull/12855 为什么会发生这种行为.

It is a problem with the behavior of sc.emptyRDD(). You can find more information on https://github.com/apache/spark/pull/12855 why exactly this behavior occurs.

当前解决方案是执行以下操作: df = spark_session.createDataFrame(sc.emptyRDD(),schema).repartition(1)并仍然具有所问问题中提到的配置设置.

The current solution is to do the following: df = spark_session.createDataFrame(sc.emptyRDD(), schema).repartition(1) and still have the config settings mentioned in the question asked.

这篇关于在Spark 2.1.0中启用_metadata文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆