从Spark写入时避免丢失分区数据的数据类型 [英] Avoid losing data type for the partitioned data when writing from Spark
问题描述
我有一个如下数据框.
itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0
我想将此数据帧另存为分区镶木文件:
I would like to save this dataframe as partitioned parquet file:
df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)
对于这个数据帧,当我读回数据时,它的字符串类型将为itemCategory
.
For this dataframe, when I read the data back, it will have String the data type for itemCategory
.
但是有时候,我有来自其他租户的数据帧,如下所示.
However at times, I have dataframe from other tenants as below.
itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0
在这种情况下,写入分区后,回读时,结果数据帧将具有itemCategory
数据类型的Int.
In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory
.
Parquet文件具有描述数据类型的元数据.如何为分区指定数据类型,以便将其读为String而不是Int?
Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?
推荐答案
如果将"spark.sql.sources.partitionColumnTypeInference.enabled"设置为"false",spark将推断所有分区列为字符串.
If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.
在spark 2.0或更高版本中,您可以这样设置:
In spark 2.0 or greater, you can set like this:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
在1.6中,像这样:
sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")
缺点是每次读取数据时都必须这样做,但至少可以正常工作.
The downside is you have to do this each time you read the data, but at least it works.
这篇关于从Spark写入时避免丢失分区数据的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!