spark-csv 包中的 inferSchema [英] inferSchema in spark-csv package

查看:37
本文介绍了spark-csv 包中的 inferSchema的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当 CSV 在 spark 中被读取为数据框时,所有列都被读取为字符串.有没有办法获得列的实际类型?

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

我有以下 csv 文件

I have the following csv file

Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10

我已使用以下代码阅读了 CSV

I've read the CSV using the below code

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleAdDataS3Location)
df.schema

所有列都被读取为字符串.我希望 years_of_experience 列被读作 intDOB 被读作 date

All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date

请注意,我已将选项 inferSchema 设置为 true.

Please note that I've set the option inferSchema to true.

我使用的是最新版本 (1.0.3) 的 spark-csv 包

I am using the latest version (1.0.3) of spark-csv package

我在这里遗漏了什么吗?

Am I missing something here?

推荐答案

2015-07-30

最新版本其实是1.1.0,但这并不重要,因为它看起来像 inferSchema 未包含在最新版本中.

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.

2015-08-17

软件包的最新版本现在是 1.2.0(发布于 2015-08-06)并且模式推断按预期工作:

The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:

scala> df.printSchema
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: string (nullable = true)

关于自动日期解析,我怀疑它是否会发生,或者至少在不提供额外元数据的情况下不会发生.

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

即使所有字段都遵循某种类似日期的格式,也无法确定是否应将给定字段解释为日期.因此,要么缺少自动日期推断,要么像一团糟的电子表格.更不用说时区问题了.

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

最后,您可以轻松地手动解析日期字符串:

Finally you can easily parse date string manually:

sqlContext
  .sql("SELECT *, DATE(dob) as dob_d  FROM df")
  .drop("DOB")
  .printSchema

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- dob_d: date (nullable = true)

所以这真的不是一个严重的问题.

so it is really not a serious issue.

2017-12-20:

内置 csv 解析器可用,因为 Spark 2.0 支持日期和时间戳的模式推断 - 它使用两个选项:

Built-in csv parser available since Spark 2.0 supports schema inference for dates and timestamp - it uses two options:

  • timestampFormat 默认yyyy-MM-dd'T'HH:mm:ss.SSSXXX
  • dateFormat 使用默认 yyyy-MM-dd
  • timestampFormat with default yyyy-MM-dd'T'HH:mm:ss.SSSXXX
  • dateFormat with default yyyy-MM-dd

另请参阅如何强制 CSV 的 inferSchema 将整数视为日期(使用dateFormat"选项)?

这篇关于spark-csv 包中的 inferSchema的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆