则InferSchema在火花CSV包 [英] inferSchema in spark-csv package

查看:1026
本文介绍了则InferSchema在火花CSV包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在CSV被读作火花数据框,所有列读为字符串。有没有什么办法让实际类型的列?

When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column?

我有以下csv文件

Name,Department,years_of_experience,DOB
Sam,Software,5,1990-10-10
Alex,Data Analytics,3,1992-10-10

我使用下面的code读取CSV

I've read the CSV using the below code

val df = sqlContext.
                  read.
                  format("com.databricks.spark.csv").
                  option("header", "true").
                  option("inferSchema", "true").
                  load(sampleAdDataS3Location)
df.schema

所有列读为字符串。我期望列的 years_of_experience 的被解读为的 INT 的和的 DOB 的被解读为的日期

All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date

请注意,我已经设置的选项的则InferSchema 的到的真正

Please note that I've set the option inferSchema to true.

我使用的火花CSV软件包的最新版本(1.0.3)

I am using the latest version (1.0.3) of spark-csv package

我失去了一些东西在这里?

Am I missing something here?

推荐答案

二零一五年七月三十日

最新版本实际上是 1.1.0 ,但它其实并不重要,因为它看起来像则InferSchema 不包含在最新版本中

The latest version is actually 1.1.0, but it doesn't really matter since it looks like inferSchema is not included in the latest release.

2015年8月17日

软件包的最新版本是现在 1.2.0 (发表于2015年8月6日)和架构推断按预期工作:

The latest version of the package is now 1.2.0 (published on 2015-08-06) and schema inference works as expected:

scala> df.printSchema
root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- DOB: string (nullable = true)

关于自动日期解析我怀疑它永远不会发生,或至少在没有提供额外的元数据。

Regarding automatic date parsing I doubt it will ever happen, or at least not without providing additional metadata.

即使所有字段遵循某些日期样形式是不可能的说,如果一个给定场应该PTED作为日期间$ P $。因此,它要么是缺少了自动日期推断或s preadsheet喜欢乱。且不说问题进行时区为例。

Even if all fields follow some date-like format it is impossible to say if a given field should be interpreted as a date. So it is either lack of out automatic date inference or spreadsheet like mess. Not to mention issues with timezones for example.

最后,你可以很容易地手工解析日期字符串:

Finally you can easily parse date string manually:

sqlContext
  .sql("SELECT *, DATE(dob) as dob_d  FROM df")
  .drop("DOB")
  .printSchema

root
 |-- Name: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- years_of_experience: integer (nullable = true)
 |-- dob_d: date (nullable = true)

所以它真的不是一个严重的问题。

so it is really not a serious issue.

这篇关于则InferSchema在火花CSV包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆