读取带引号包含嵌入式逗号的csv文件 [英] Reading csv files with quoted fields containing embedded commas

查看:120
本文介绍了读取带引号包含嵌入式逗号的csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Pyspark中读取csv文件,如下所示:

I am reading a csv file in Pyspark as follows:

df_raw=spark.read.option("header","true").csv(csv_path)

但是,数据文件中引用了带有嵌入式逗号的字段,其中 不应视为逗号.如何在Pyspark中处理此问题?我知道大熊猫可以解决这个问题,但是Spark可以吗?我使用的版本是Spark 2.0.0.

However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0.

这是一个在Pandas中有效但无法使用Spark的示例:

Here is an example which works in Pandas but fails using Spark:

In [1]: import pandas as pd

In [2]: pdf = pd.read_csv('malformed_data.csv')

In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)

In [4]: pdf[['col12','col13','col14']]
Out[4]:
                    col12                                             col13  \
0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC

   col14
0   23.0
1    0.0

In [5]: sdf.select("col12","col13",'col14').show()
+------------------+--------------------+--------------------+
|             col12|               col13|               col14|
+------------------+--------------------+--------------------+
|"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...|
|              null|OUTKAST#THROOTS~W...|                 0.0|
+------------------+--------------------+--------------------+

文件内容:

    col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE

推荐答案

我注意到您有问题的行已转义,并使用了双引号:

I noticed that your problematic line has escaping that uses double quotes themselves:

"32 XIY""W""JK,RE LK"

"32 XIY ""W"" JK, RE LK"

应该与

32 XIY"W" JK,RE LK

32 XIY "W" JK, RE LK

RFC-4180 ,第2页-

  1. 如果使用双引号将字段括起来,则必须在字段内部出现双引号,方法是在其前面加上另一个双引号来对其进行转义

例如,默认情况下,Excel就是这样做的.

That's what Excel does, for example, by default.

尽管在Spark(自Spark 2.1起)中,默认情况下转义是使用backslah(\)通过非RFC方式完成的.要解决此问题,您必须明确告知Spark使用双引号作为转义字符:

Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:

.option("quote", "\"")
.option("escape", "\"")

这可能解释了逗号字符没有被引用,因为它是在引号内.

This may explain that a comma character wasn't interpreted as it was inside a quoted column.

Apache Spark网站上没有很好地记录Spark csv格式的选项,但是这里有些旧文档,但我仍然经常发现它有用:

Options for Spark csv format are not documented well on Apache Spark site, but here's a bit older documentation which I still find useful quite often:

https://github.com/databricks/spark-csv

2018年8月更新:Spark 3.0可能将此行为更改为符合RFC.有关详细信息,请参见 SPARK-22236 .

Update Aug 2018: Spark 3.0 might change this behavior to be RFC-compliant. See SPARK-22236 for details.

这篇关于读取带引号包含嵌入式逗号的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆