使用pyspark处理csv文件中字段中的逗号 [英] Dealing with commas within a field in a csv file using pyspark
问题描述
我有一个csv数据文件,其中的列值中包含逗号.例如
I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
此处,值为"AAA_A","BBB,B","CCC_C".但是,当尝试用逗号分隔行时,它给了我4个值,即"AAA_A","BBB","B","CCC_C".
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
在PySpark中用逗号分隔行后如何获得正确的值?
How to get the right values after splitting the line by commas in PySpark?
推荐答案
使用databriks中的spark-csv类.
Use spark-csv class from databriks.
默认情况下(),引号之间的分隔符将被忽略.
Delimiters between quotes, by default ("), are ignored.
示例:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
有关更多信息,请查看 https://github.com/databricks/spark-csv
For more info, review https://github.com/databricks/spark-csv
如果引号是()的(')实例,则可以使用此类进行配置.
If your quote is (') instance of ("), you could configure with this class.
对于python API:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
最诚挚的问候.
这篇关于使用pyspark处理csv文件中字段中的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!