从 Spark 写入时避免丢失分区数据的数据类型 [英] Avoid losing data type for the partitioned data when writing from Spark

查看:43
本文介绍了从 Spark 写入时避免丢失分区数据的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据框.

I have a dataframe like below.

itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0

我想将此数据框保存为分区镶木地板文件:

I would like to save this dataframe as partitioned parquet file:

df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)

对于这个数据框,当我读回数据时,它会有itemCategory的数据类型字符串.

For this dataframe, when I read the data back, it will have String the data type for itemCategory.

但是有时,我有来自其他租户的数据框,如下所示.

However at times, I have dataframe from other tenants as below.

itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0

在这种情况下,作为分区写入后,回读时,结果数据帧将为itemCategory的数据类型.

In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.

Parquet 文件具有描述数据类型的元数据.如何指定分区的数据类型,以便将其作为 String 而不是 Int 读回?

Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

推荐答案

如果将spark.sql.sources.partitionColumnTypeInference.enabled"设置为false",spark 会将所有分区列推断为字符串.

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

在 spark 2.0 或更高版本中,您可以这样设置:

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

在 1.6 中,像这样:

In 1.6, like this:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

缺点是每次读取数据时都必须这样做,但至少它有效.

The downside is you have to do this each time you read the data, but at least it works.

这篇关于从 Spark 写入时避免丢失分区数据的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆