如何强制CSV的inferSchema将整数视为日期(使用"dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

查看:176
本文介绍了如何强制CSV的inferSchema将整数视为日期(使用"dateFormat"选项)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark 2.2.0

I use Spark 2.2.0

我正在读取csv文件,如下所示:

I am reading a csv file as follows:

val dataFrame = spark.read.option("inferSchema", "true")
                          .option("header", true)
                          .option("dateFormat", "yyyyMMdd")
                          .csv(pathToCSVFile)

此文件中有一个日期列,并且该特定列的所有记录的值均等于20171001.

There is one date column in this file, and all records has a value equal to 20171001 for this particular column.

问题在于,spark推断此列的类型为integer而不是date.当我删除"inferSchema"选项时,该列的类型为string.

The issue is that spark is inferring that that the type of this column is integer rather than date. When I remove the "inferSchema" option, the type of that column is string.

此文件中没有null值,也没有格式错误的行.

There is no null values, nor any wrongly formatted line in this file.

此问题的原因/解决方案是什么?

What is the reason/solution for this issue?

推荐答案

如果我的理解是正确的,则

If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):

  • NullType
  • IntegerType
  • LongType
  • DecimalType
  • DoubleType
  • TimestampType
  • BooleanType
  • StringType
  • NullType
  • IntegerType
  • LongType
  • DecimalType
  • DoubleType
  • TimestampType
  • BooleanType
  • StringType

因此,我认为问题在于20171001IntegerType匹配,甚至没有考虑TimestampType(它使用timestampFormat而不是dateFormat选项).

With that, I think the issue is that 20171001 matches IntegerType before even considering TimestampType (which uses timestampFormat not dateFormat option).

一种解决方案是定义架构并与DataFrameReaderschema运算符一起使用,或者让Spark SQL推断架构并使用cast运算符.

One solution would be to define the schema and use it with schema operator (of DataFrameReader) or let Spark SQL infer the schema and use cast operator.

如果字段数不高,我会选择前者.

I'd choose the former if the number of fields is not high.

这篇关于如何强制CSV的inferSchema将整数视为日期(使用"dateFormat"选项)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆