如何强制CSV的inferSchema将整数视为日期(使用"dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
问题描述
我使用Spark 2.2.0
I use Spark 2.2.0
我正在读取csv文件,如下所示:
I am reading a csv file as follows:
val dataFrame = spark.read.option("inferSchema", "true")
.option("header", true)
.option("dateFormat", "yyyyMMdd")
.csv(pathToCSVFile)
此文件中有一个日期列,并且该特定列的所有记录的值均等于20171001
.
There is one date column in this file, and all records has a value equal to 20171001
for this particular column.
问题在于,spark推断此列的类型为integer
而不是date
.当我删除"inferSchema"
选项时,该列的类型为string
.
The issue is that spark is inferring that that the type of this column is integer
rather than date
. When I remove the "inferSchema"
option, the type of that column is string
.
此文件中没有null
值,也没有格式错误的行.
There is no null
values, nor any wrongly formatted line in this file.
此问题的原因/解决方案是什么?
What is the reason/solution for this issue?
推荐答案
If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):
-
NullType
-
IntegerType
-
LongType
-
DecimalType
-
DoubleType
-
TimestampType
-
BooleanType
-
StringType
NullType
IntegerType
LongType
DecimalType
DoubleType
TimestampType
BooleanType
StringType
因此,我认为问题在于20171001
与IntegerType
匹配,甚至没有考虑TimestampType
(它使用timestampFormat
而不是dateFormat
选项).
With that, I think the issue is that 20171001
matches IntegerType
before even considering TimestampType
(which uses timestampFormat
not dateFormat
option).
一种解决方案是定义架构并与DataFrameReader
的schema
运算符一起使用,或者让Spark SQL推断架构并使用cast
运算符.
One solution would be to define the schema and use it with schema
operator (of DataFrameReader
) or let Spark SQL infer the schema and use cast
operator.
如果字段数不高,我会选择前者.
I'd choose the former if the number of fields is not high.
这篇关于如何强制CSV的inferSchema将整数视为日期(使用"dateFormat"选项)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!