使用Spark内置函数或方法在Pyspark中解析csv文件 [英] Parsing a csv file in Pyspark using Spark inbuilt functions or methods

查看:77
本文介绍了使用Spark内置函数或方法在Pyspark中解析csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark 2.3版并正在处理一些poc,其中,我必须加载一些csv文件来触发dataframe.

I am using spark version 2.3 and working on some poc wherein, I have to load some bunch of csv files to spark dataframe.

将下面的csv视为示例,我需要解析并将其加载到dataframe中.给定的csv有多个错误记录,需要对其进行识别.

Considering below csv as a sample which I need to parse and load it into dataframe. The given csv has multiple bad records which needs to be identified.

id,name,age,loaded_date,sex
1,ABC,32,2019-09-11,M
2,,33,2019-09-11,M
3,XYZ,35,2019-08-11,M
4,PQR,32,2019-30-10,M   #invalid date
5,EFG,32,               #missing other column details
6,DEF,32,2019/09/11,M   #invalid date format
7,XYZ,32,2017-01-01,9   #last column has to be character only
8,KLM,XX,2017-01-01,F
9,ABC,3.2,2019-10-10,M  #decimal value for integer data type
10,ABC,32,2019-02-29,M  #invalid date

如果我必须使用python或pandas函数进行解析,那将是一件容易的事.

It would have been an easy task, if I have to parse it using python or pandas functions.

这就是我为此定义架构的方式.

This's how I defined schema for this.

from pyspark.sql.types import*
schema = StructType([
            StructField("id",            IntegerType(), True),
            StructField("name",          StringType(), True),
            StructField("age",           IntegerType(), True),
            StructField("loaded_date",   DateType(), True),
            StructField("sex",           StringType(), True),
            StructField("corrupt_record",StringType(), True)])



df=spark.read.format("com.databricks.spark.csv") \
.option("header", "true") \
.option("dateFormat", "yyyy-MM-dd") \
.option("nanValue","0") \
.option("nullValue"," ") \
.option("treatEmptyValuesAsNulls","false") \
.option("columnNameOfCorruptRecord", "corrupt_record") \
.schema(schema).load(file)

>>> df.show(truncate=False)
+----+----+----+-----------+----+----------------------+
|id  |name|age |loaded_date|sex |corrupt_record        |
+----+----+----+-----------+----+----------------------+
|1   |ABC |32  |2019-09-11 |M   |null                  |
|2   |null|33  |2019-09-11 |M   |null                  |
|3   |XYZ |35  |2019-08-11 |M   |null                  |
|4   |PQR |32  |2021-06-10 |M   |null                  |
|5   |EFG |32  |null       |null|5,EFG,32,             |
|null|null|null|null       |null|6,DEF,32,2019/09/11,M |
|7   |XYZ |32  |2017-01-01 |9   |null                  |
|null|null|null|null       |null|8,KLM,XX,2017-01-01,F |
|null|null|null|null       |null|9,ABC,3.2,2019-10-10,M|
|10  |ABC |32  |2019-03-01 |M   |null                  |
+----+----+----+-----------+----+----------------------+

以上代码已按预期解析了许多记录,但未能检查无效日期.请参阅记录'4'& '10'.它已转换为某些垃圾日期.

Above code has parsed many records as expected but has failed to check on invalid dates. see record '4' & '10'. It has converted to some junk dates.

我可以将日期作为字符串类型加载并创建一些udf或使用强制转换来正确解析它,并查看输入的日期是否有效.有没有一种方法可以首先检查无效日期,而无需在代码中使用自定义udf或更高版本.

I can load dates as string type and create some udf or use cast to parse it correctly and to see whether a date entered is valid or not. Is there any way to check invalid date in first place without using a custom udf or later in the code.

此外,我正在寻找一种处理记录'7'的方法,该记录的最后一列具有数字值.

Also, I was looking a some way to handle record '7' which is having a numeric value for last column.

推荐答案

作为首先,只需加载数据而无需任何预先指定的架构,就像@AndrzejS所做的那样

First of all, just load the data without any prespecified schema, also as done by @AndrzejS

df = spark.read.option("header", "true").csv("data/yourdata.csv")
df.show()
+---+----+---+-----------+----+
| id|name|age|loaded_date| sex|
+---+----+---+-----------+----+
|  1| ABC| 32| 2019-09-11|   M|
|  2|null| 33| 2019-09-11|   M|
|  3| XYZ| 35| 2019-08-11|   M|
|  4| PQR| 32| 2019-30-10|   M|
|  5| EFG| 32|       null|null|
|  6| DEF| 32| 2019/09/11|   M|
|  7| XYZ| 32| 2017-01-01|   9|
|  8| KLM| XX| 2017-01-01|   F|
|  9| ABC|3.2| 2019-10-10|   M|
| 10| ABC| 32| 2019-02-29|   M|
+---+----+---+-----------+----+

然后,我们需要确定哪些值不适合列的方案.例如XX32不能为age,因此这些值需要标记为Null.我们对此值是否为Integer进行测试.同样,我们测试loaded_date是否确实是date,然后我们检查sex是否是F/M.请参阅我的上一个发布这些测试.

Then, we need to determine the which of the values do not fit into the scheme of columns. For eg; XX or 32 cannot be an age, so these values need to be marked as Null. We do a test if this value is an Integer or else. Similarly, we do the test if loaded_date is indeed a date or not and fianlly we check if the sex is either F/M. Please refer to my previous post on these tests.

df = df.select('id','name',
               'age', (col('age').cast('int').isNotNull() & (col('age').cast('int') - col('age') == 0)).alias('ageInt'),
               'loaded_date',(col('loaded_date').cast('date').isNotNull()).alias('loaded_dateDate'),
               'sex'
              )
df.show()
+---+----+---+------+-----------+---------------+----+
| id|name|age|ageInt|loaded_date|loaded_dateDate| sex|
+---+----+---+------+-----------+---------------+----+
|  1| ABC| 32|  true| 2019-09-11|           true|   M|
|  2|null| 33|  true| 2019-09-11|           true|   M|
|  3| XYZ| 35|  true| 2019-08-11|           true|   M|
|  4| PQR| 32|  true| 2019-30-10|          false|   M|
|  5| EFG| 32|  true|       null|          false|null|
|  6| DEF| 32|  true| 2019/09/11|          false|   M|
|  7| XYZ| 32|  true| 2017-01-01|           true|   9|
|  8| KLM| XX| false| 2017-01-01|           true|   F|
|  9| ABC|3.2| false| 2019-10-10|           true|   M|
| 10| ABC| 32|  true| 2019-02-29|          false|   M|
+---+----+---+------+-----------+---------------+----+

最后,使用if/else(即pyspark是when/otherwise)将不相关的值标记为Null.

Finally, using if/else, which is pyspark is when/otherwise to mark irrelevant values as Null.

df = df.withColumn('age',when(col('ageInt')==True,col('age')).otherwise(None))\
       .withColumn('loaded_date',when(col('loaded_dateDate')==True,col('loaded_date')).otherwise(None))\
       .withColumn('sex',when(col('sex').isin('M','F'),col('sex')).otherwise(None))\
       .drop('ageInt','loaded_dateDate')
df.show()
+---+----+----+-----------+----+
| id|name| age|loaded_date| sex|
+---+----+----+-----------+----+
|  1| ABC|  32| 2019-09-11|   M|
|  2|null|  33| 2019-09-11|   M|
|  3| XYZ|  35| 2019-08-11|   M|
|  4| PQR|  32|       null|   M|
|  5| EFG|  32|       null|null|
|  6| DEF|  32|       null|   M|
|  7| XYZ|  32| 2017-01-01|null|
|  8| KLM|null| 2017-01-01|   F|
|  9| ABC|null| 2019-10-10|   M|
| 10| ABC|  32|       null|   M|
+---+----+----+-----------+----+

这篇关于使用Spark内置函数或方法在Pyspark中解析csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆