使用pyspark处理csv文件中字段中的逗号 [英] Dealing with commas within a field in a csv file using pyspark

查看:389
本文介绍了使用pyspark处理csv文件中字段中的逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv数据文件,其中的列值中包含逗号.例如

I have a csv data file containing commas within a column value. For example,

value_1,value_2,value_3  
AAA_A,BBB,B,CCC_C  

此处,值为"AAA_A","BBB,B","CCC_C".但是,当尝试用逗号分隔行时,它给了我4个值,即"AAA_A","BBB","B","CCC_C".

Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".

在PySpark中用逗号分隔行后如何获得正确的值?

How to get the right values after splitting the line by commas in PySpark?

推荐答案

使用databriks中的spark-csv类.

Use spark-csv class from databriks.

默认情况下(),引号之间的分隔符将被忽略.

Delimiters between quotes, by default ("), are ignored.

示例:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

有关更多信息,请查看 https://github.com/databricks/spark-csv

For more info, review https://github.com/databricks/spark-csv

如果引号是()的(')实例,则可以使用此类进行配置.

If your quote is (') instance of ("), you could configure with this class.

对于python API:

For python API:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

最诚挚的问候.

这篇关于使用pyspark处理csv文件中字段中的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆