inferSchema 使用 spark.read.format("com.crealytics.spark.excel") 推断日期类型列的 double [英] inferSchema using spark.read.format("com.crealytics.spark.excel") is inferring double for a date type column

查看:107
本文介绍了inferSchema 使用 spark.read.format("com.crealytics.spark.excel") 推断日期类型列的 double的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 PySpark(Python 3.6 和 Spark 2.1.1)并尝试使用 spark.read.format("com.crealytics.spark.excel"),但它为日期类型列推断双精度.

I am working on PySpark (Python 3.6 and Spark 2.1.1) and trying to fetch data from an excel file using spark.read.format("com.crealytics.spark.excel"), but it is inferring double for a date type column.

示例:

输入 -

 df = spark.read.format("com.crealytics.spark.excel").\
     option("location", "D:\\Users\\ABC\\Desktop\\TmpData\\Input.xlsm").\
     option("spark.read.simpleMode","true"). \
     option("treatEmptyValuesAsNulls", "true").\
     option("addColorColumns", "false").\ 
     option("useHeader", "true").\  option("inferSchema", "true").\
     load("com.databricks.spark.csv")

结果:

Name | Age | Gender | DateOfApplication
________________________________________
X    | 12  |   F    |  5/20/2015

Y    | 15  |   F    |  5/28/2015

Z    | 14  |   F    |  5/29/2015

打印架构 -

df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Gender: string (nullable = true)
 |-- DateOfApplication: double (nullable = true)

做 .show -

df.show()

Name | Age | Gender | DateOfApplication
________________________________________
X    | 12.0  |   F    |   42144.0

Y    | 15.0  |   F    |   16836.0

Z    | 14.0  |   F    |   42152.0

在读取数据集时,日期或任何其他数值被转换为双精度值(日期的特殊问题是它完全改变了难以恢复到原始日期的值.

While the reading of the data-set the dates or any other numeric value is being converted to double (special problem with date is that it totally changes the value which is hard to be reverted back to original dates.

我可以帮忙吗?

推荐答案

插件作者在这里 :)

推断列类型是 在插件本身中完成.该代码是 取自 spark-csv.从代码中可以看出,目前仅推断出 String、Numeric、Boolean 和 Blank 单元格类型.

Inferring column types is done in the plugin itself. That code was taken from spark-csv. As you can see from the code, only String, Numeric, Boolean and Blank cell types are currently inferred.

最好的选择是创建一个 PR,通过使用 对应的DateUtil API.

The best option would be to create a PR which properly infers date columns by using the corresponding DateUtil API.

第二好的选择是手动指定架构,类似于@addmeaning 的描述.请注意,我刚刚发布了 0.9.0 版本,它将一些必需的参数设为可选更改需要指定文件路径的方式.

The second-best option would be to specify the schema manually similar to how @addmeaning described. Note that I've just released version 0.9.0 which makes some required parameters optional and changes the way the path to the file needs to be specified.

yourSchema = StructType()
                     .add("Name", StringType(), True)
                     .add("Age", DoubleType(), True)
                     .add("Gender", StringType(), True)
                     .add("DateOfApplication", DateType(), True)

df = spark.read.format("com.crealytics.spark.excel").
         schema(yourSchema).
         option("useHeader", "true").\
         load("D:\\Users\\ABC\\Desktop\\TmpData\\Input.xlsm")

这篇关于inferSchema 使用 spark.read.format("com.crealytics.spark.excel") 推断日期类型列的 double的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆