如何使用pyspark在引号中读取带有附加逗号的csv文件? [英] How to read csv file with additional comma in quotes using pyspark?

查看:51
本文介绍了如何使用pyspark在引号中读取带有附加逗号的csv文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在以 UTF-16 格式读取以下 CSV 数据时遇到一些问题:

I am having some troubles reading the following CSV data in UTF-16:

FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,

据我所知,这对读者来说应该不是问题,因为有一个 quote 参数来处理.

As far as I understand, it should not be a problem for a reader, since there is a quote parameter to handle that.

df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')

然而,这仍然会给我一个不正确的分割:

However, this would still give me an incorrect split:

有什么方法可以处理这些情况,还是我需要用 RDD 来解决?

Is there some way to handle those cases or do I need to work it around with RDD?

提前致谢.

推荐答案

您可以使用 spark.read.text 作为文本阅读,并使用一些正则表达式拆分值以按逗号拆分但忽略引号(你可以看到这个帖子),然后从结果数组中获取对应的列:

You can read as text using spark.read.text and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+
#|FullName  |FullLabel                                         |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+

更新:

对于 utf-16 格式的输入文件,您可以通过将文件加载为 binaryFiles 来替换 spark.read.text,然后将其转换为结果 rdd 进入数据帧:

For input file in utf-16, you can replace spark.read.text by loading the file as binaryFiles and then convert the resulting rdd into dataframe :

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

这篇关于如何使用pyspark在引号中读取带有附加逗号的csv文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆