如何使用pyspark读取带引号的CSV文件? [英] How to read csv file with additional comma in quotes using pyspark?

查看:66
本文介绍了如何使用pyspark读取带引号的CSV文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在读取 UTF-16 中的以下CSV数据时遇到了一些麻烦:

I am having some troubles reading the following CSV data in UTF-16:

FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,

据我了解,对读者来说这应该不是问题,因为有一个 quote 参数可以处理该问题.

As far as I understand, it should not be a problem for a reader, since there is a quote parameter to handle that.

df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')

但是,这仍然会给我不正确的分割:

However, this would still give me an incorrect split:

有什么方法可以处理这些情况吗?还是需要使用RDD来解决?

Is there some way to handle those cases or do I need to work it around with RDD?

谢谢.

推荐答案

您可以使用 spark.read.text 读取为文本,并使用一些正则表达式对值进行分割以逗号分隔,但可以忽略引号(您可以查看此帖子),然后从结果数组中获取相应的列:

You can read as text using spark.read.text and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+
#|FullName  |FullLabel                                         |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+

更新:

对于 utf-16 中的输入文件,可以通过将文件加载为 binaryFiles 来替换 spark.read.text ,然后将其转换为生成的rdd进入数据帧:

For input file in utf-16, you can replace spark.read.text by loading the file as binaryFiles and then convert the resulting rdd into dataframe :

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

这篇关于如何使用pyspark读取带引号的CSV文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆