如何防止pyspark将以JSON对象作为其值的csv字段中的逗号解释为定界符 [英] How do I prevent pyspark from interpreting commas as a delimiter in a csv field having JSON object as its value

查看:40
本文介绍了如何防止pyspark将以JSON对象作为其值的csv字段中的逗号解释为定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用pyspark版本2.4.5和Databrick的spark-csv模块读取以逗号分隔的csv文件.csv文件中的字段之一以json对象作为其值.csv的内容如下

I am trying to read a comma delimited csv file using pyspark version 2.4.5 and Databrick's spark-csv module. One of the field in the csv file has a json object as its value. The contents of the csv are as below

header_col_1, header_col_2, header_col_3
one, two, three
one, {"key1":"value1","key2":"value2","key3":"value3","key4":"value4"}, three

我发现的其他解决方案已读取了定义为转义":" 定界符":," 的选项.这似乎不起作用,因为相关字段中的逗号未用双引号引起来.下面是我用来读取csv文件的源代码

Other solutions that I found had read options defined as "escape": '"', and 'delimiter': ",". This seems not to be working as the commas in the field in question are not enclosed in double quotes. Below is the source code that I am using to read the csv file

from pyspark.sql import SparkSession
import findspark

findspark.init()

spark = SparkSession.builder.appName('test').getOrCreate()

read_options = {
    'header': 'true',
    "escape": '"',
    'delimiter': ",",
    'inferSchema': 'false',
}

spark_df = spark.read.format('com.databricks.spark.csv').options(**read_options).load('test.csv')

print(spark_df.show())

上述程序的输出如下所示

Output of the above program is as shown below

+------------+-----------------+---------------+
|header_col_1|     header_col_2|   header_col_3|
+------------+-----------------+---------------+
|         one|              two|          three|
|         one| {"key1":"value1"|"key2":"value2"|
+------------+-----------------+---------------+

推荐答案

在CSV文件中,您必须将JSON字符串放在双引号中.JSON字符串中的双引号必须用反斜杠(\)进行转义.删除转义选项,因为它不正确.默认情况下,定界符设置为,".转义字符为'\',引号字符为''".请参阅 Databricks文档

In the CSV file, you have to put the JSON string in straight double quotes. The double quotes in your JSON string must be escaped by backslashes (\"). Remove your escape option as it is incorrect. By default, the delimiter is set to "," the escape character to '\' and the quote character to '"'. Refer to Databricks documentation

这篇关于如何防止pyspark将以JSON对象作为其值的csv字段中的逗号解释为定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆