从Spark写入时避免丢失分区数据的数据类型 [英] Avoid losing data type for the partitioned data when writing from Spark

查看:391
本文介绍了从Spark写入时避免丢失分区数据的数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下数据框.

itemName, itemCategory
Name1, C0
Name2, C1
Name3, C0

我想将此数据帧另存为分区镶木文件:

I would like to save this dataframe as partitioned parquet file:

df.write.mode("overwrite").partitionBy("itemCategory").parquet(path)

对于这个数据帧,当我读回数据时,它的字符串类型将为itemCategory.

For this dataframe, when I read the data back, it will have String the data type for itemCategory.

但是有时候,我有来自其他租户的数据帧,如下所示.

However at times, I have dataframe from other tenants as below.

itemName, itemCategory
Name1, 0
Name2, 1
Name3, 0

在这种情况下,写入分区后,回读时,结果数据帧将具有itemCategory数据类型的Int.

In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory.

Parquet文件具有描述数据类型的元数据.如何为分区指定数据类型,以便将其读为String而不是Int?

Parquet file has the metadata that describe the data type. How can I specify the data type for the partition so it will be read back as String instead of Int?

推荐答案

如果将"spark.sql.sources.partitionColumnTypeInference.enabled"设置为"false",spark将推断所有分区列为字符串.

If you set "spark.sql.sources.partitionColumnTypeInference.enabled" to "false", spark will infer all partition columns as Strings.

在spark 2.0或更高版本中,您可以这样设置:

In spark 2.0 or greater, you can set like this:

spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

在1.6中,像这样:

sqlContext.setConf("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

缺点是每次读取数据时都必须这样做,但至少可以正常工作.

The downside is you have to do this each time you read the data, but at least it works.

这篇关于从Spark写入时避免丢失分区数据的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆