从 Spark 写入时避免丢失分区数据的数据类型 [英] Avoid losing data type for the partitioned data when writing from Spark
问题描述
我有一个如下所示的数据框.
I have a dataframe like below.
itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0
我想将此数据框保存为分区镶木地板文件:
I would like to save this dataframe as partitioned parquet file:
df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)
对于这个数据框,当我读回数据时,它会有itemCategory
的数据类型字符串.
For this dataframe, when I read the data back, it will have String the data type for itemCategory
.
但是有时,我有来自其他租户的数据框,如下所示.
However at times, I have dataframe from other tenants as below.
itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0
在这种情况下,作为分区写入后,回读时,结果数据帧将为itemCategory
的数据类型.
In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory
.
Parquet 文件具有描述数据类型的元数据.如何指定分区的数据类型,以便将其作为 String 而不是 Int 读回?
Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?
推荐答案
如果将spark.sql.sources.partitionColumnTypeInference.enabled"设置为false",spark 会将所有分区列推断为字符串.
If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.
在 spark 2.0 或更高版本中,您可以这样设置:
In spark 2.0 or greater, you can set like this:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
在 1.6 中,像这样:
In 1.6, like this:
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
缺点是每次读取数据时都必须这样做,但至少它有效.
The downside is you have to do this each time you read the data, but at least it works.
这篇关于从 Spark 写入时避免丢失分区数据的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!