如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

查看：33 发布时间：2021/11/14 21:30:26 apache-spark dataframe apache-spark-sql spark-csv

本文介绍了如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 Spark 2.2.0

I use Spark 2.2.0

我正在读取一个 csv 文件，如下所示:

I am reading a csv file as follows:

val dataFrame = spark.read.option("inferSchema", "true")
                          .option("header", true)
                          .option("dateFormat", "yyyyMMdd")
                          .csv(pathToCSVFile)

此文件中有一个日期列，并且该特定列的所有记录的值都等于 20171001.

There is one date column in this file, and all records has a value equal to 20171001 for this particular column.

问题是 spark 推断该列的类型是 integer 而不是 date.当我删除 "inferSchema" 选项时，该列的类型是 string.

The issue is that spark is inferring that that the type of this column is integer rather than date. When I remove the "inferSchema" option, the type of that column is string.

此文件中没有 null 值，也没有任何格式错误的行.

There is no null values, nor any wrongly formatted line in this file.

这个问题的原因/解决方案是什么?

What is the reason/solution for this issue?

推荐答案

如果我的理解是正确的，代码意味着以下类型推断的顺序(首先检查第一个类型):

If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):

NullType
IntegerType
LongType
DecimalType
DoubleType
时间戳类型
BooleanType
StringType

有了这个，我认为问题是 20171001 在考虑 TimestampType(使用 timestampFormat> 不是 dateFormat 选项).

With that, I think the issue is that 20171001 matches IntegerType before even considering TimestampType (which uses timestampFormat not dateFormat option).

一种解决方案是定义模式并将其与 schema 运算符(DataFrameReader)一起使用，或者让 Spark SQL 推断模式并使用 cast运算符.


One solution would be to define the schema and use it with schema operator (of DataFrameReader) or let Spark SQL infer the schema and use cast operator.
如果字段数不多，我会选择前者.
I'd choose the former if the number of fields is not high.

                        这篇关于如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with &quot;dateFormat&quot; option)?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何强制 CSV 的 inferSchema 将整数视为日期(使用“dateFormat"选项)? [英] How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

登录关闭