如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
问题描述
我使用 Spark 2.2.0
I use Spark 2.2.0
我正在读取一个 csv 文件,如下所示:
I am reading a csv file as follows:
val dataFrame = spark.read.option("inferSchema", "true")
.option("header", true)
.option("dateFormat", "yyyyMMdd")
.csv(pathToCSVFile)
此文件中有一个日期列,并且该特定列的所有记录的值都等于 20171001
.
There is one date column in this file, and all records has a value equal to 20171001
for this particular column.
问题是 spark 推断该列的类型是 integer
而不是 date
.当我删除 "inferSchema"
选项时,该列的类型是 string
.
The issue is that spark is inferring that that the type of this column is integer
rather than date
. When I remove the "inferSchema"
option, the type of that column is string
.
此文件中没有 null
值,也没有任何格式错误的行.
There is no null
values, nor any wrongly formatted line in this file.
这个问题的原因/解决方案是什么?
What is the reason/solution for this issue?
推荐答案
如果我的理解是正确的,代码 意味着以下类型推断的顺序(首先检查第一个类型):
If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):
NullType
IntegerType
LongType
DecimalType
DoubleType
时间戳类型
BooleanType
StringType
有了这个,我认为问题是 20171001
在考虑 TimestampType
(使用 timestampFormat
> 不是 dateFormat
选项).
With that, I think the issue is that 20171001
matches IntegerType
before even considering TimestampType
(which uses timestampFormat
not dateFormat
option).
一种解决方案是定义模式并将其与 schema
运算符(DataFrameReader
)一起使用,或者让 Spark SQL 推断模式并使用 cast代码>运算符.
One solution would be to define the schema and use it with schema
operator (of DataFrameReader
) or let Spark SQL infer the schema and use cast
operator.
如果字段数不多,我会选择前者.
I'd choose the former if the number of fields is not high.
这篇关于如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!