在Pyspark中以正确的数据类型读取CSV [英] Read in CSV in Pyspark with correct Datatypes

查看：774 发布时间：2020/7/11 23:41:29 csv pyspark pyspark-sql

本文介绍了在Pyspark中以正确的数据类型读取CSV的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我尝试导入带有spark的本地CSV时，默认情况下，每一列都以字符串形式读取.但是，我的列仅包含整数和时间戳类型.更具体地说，CSV看起来像这样:

When I am trying to import a local CSV with spark, every column is by default read in as a string. However, my columns only include integers and a timestamp type. To be more specific, the CSV looks like this:

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000

我发现了应该在此问题中起作用的代码，但是当我执行它时，所有条目都以NULL的形式返回.

I have found code that should work in this question, but when I execute it all the entries are returned as NULL.

我使用以下内容创建自定义架构:

I use the following to create a custom schema:

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType, TimestampType

customSchema = StructType(Array(
        StructField("Customer", IntegerType, true),
        StructField("TransDate", TimestampType, true),
        StructField("Quantity", IntegerType, true),
        StructField("Cost", IntegerType, true),
        StructField("TransKey", IntegerType, true)))

，然后使用以下命令读取CSV文件:

and then read in the CSV with:

myData = spark.read.load('myData.csv', format="csv", header="true", sep=',', schema=customSchema)

哪个返回:

+--------+---------+--------+----+--------+
|Customer|TransDate|Quantity|Cost|Transkey|
+--------+---------+--------+----+--------+
|    null|     null|    null|null|    null|
+--------+---------+--------+----+--------+

我错过了关键的一步吗?我怀疑日期列是问题的根源.注意:我正在GoogleCollab中运行它.

Am I missing a crucial step? I suspect that the Date column is the root of the problem. Note: I am running this in GoogleCollab.

推荐答案

去这里！

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

您的日期将作为字符串列类型加载，但是当您将其更改为日期类型时，它将将该日期格式视为NULL.

your date will get loaded as string column type, but the moment you change it to date type it will treat this date format as NULL.

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

因此，我们需要将日期格式从dd.mm.yy更改为yy-mm-dd.

So we need to change the date format from dd.mm.yy to yy-mm-dd.

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

更改日期格式的Python函数:

Python function to change the date format:

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

立即为您的数据框列调用此函数:

call this function for your dataframe column now:

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

以及以下是该模式:

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

让我知道它是否对您有用.

Let me know if it works for you.

这篇关于在Pyspark中以正确的数据类型读取CSV的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Pyspark中以正确的数据类型读取CSV [英] Read in CSV in Pyspark with correct Datatypes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Pyspark中以正确的数据类型读取CSV [英] Read in CSV in Pyspark with correct Datatypes

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭