在Spark中读取多行CSV文件时,是否有任何选项可以将引号内的换行符保留? [英] Is there any option to preserve line breaks within quotation marks when reading multiline CSV files in Spark?
问题描述
我有一些CSV文件,其中第三行的引号内有换行符(第一行是CSV标头).
I have some CSV file with line break within quotation marks in third line (first line is CSV header).
data/testdata.csv
"id", "description"
"1", "some description"
"2", "other description with line
break"
无论其CSV是否正确,我都必须将其解析为有效记录.那就是我尝试过的
Regardless if its correct CSV or not, I must parse it into valid records. That's what I tried
public class Main2 {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[4]")
.getOrCreate();
Dataset<Row> rows = spark
.read()
.format("csv")
.option("header", "true")
.load("data/testdata.csv");
rows
.foreach(row -> System.out.println(row));
}
}
输出如下:
[1, "some description"]
[2, "other description with line]
[break",null]
如您所见,Spark将break"
视为新记录,并使用空值填充缺少的列.问题是:Spark的CSV解析器是否有允许这种换行符的选项?
As you can see, Spark treats break"
as a new record and fills missing columns with null. The question is: is there any option to Spark's CSV parser that allows such line breaks?
我尝试了以下代码(参考),但是它不起作用
I tried the code below (reference) but it doesn't work
Dataset<Row> rows = spark.read()
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv("data/testdata.csv");
推荐答案
According to this article since spark 2.2.0 there is possibility for parsing multiline csv files. In my case these settings do the job:
sparkSession
.read()
.option("sep", ";")
.option("quote", "\"")
.option("multiLine", "true")
.option("ignoreLeadingWhiteSpace", true)
.csv(path.toString());
这篇关于在Spark中读取多行CSV文件时,是否有任何选项可以将引号内的换行符保留?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!