如何处理Spark中的多行? [英] How to handle multi line rows in spark?

查看:229
本文介绍了如何处理Spark中的多行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有多行观察结果的数据框:

I am having a dataframe which has some multi-line observations:

+--------------------+----------------+
|         col1|               col2|
+--------------------+----------------+
|something1           |somethingelse1  |
|something2           |somethingelse2  |
|something3           |somethingelse3  |
|something4           |somethingelse4  |
|multiline

 row               |     somethings|
|something            |somethingall    |

我想要的是以csv格式(或txt)保存此数据框.使用以下内容:

What I want is to save in csv format(or txt) this dataframe. Using the following:

df
 .write
 .format("csv")
 .save("s3://../adf/")

但是当我检查文件时,它将观察结果分成多行.我想要的是在txt/csv文件中具有多行"观测值的行是同一行.我试图将其另存为txt文件:

But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:

df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")

,但观察到相同的输出.

but the same output was observed.

我可以想象一种方法是用其他东西替换\n,然后在装回时执行反向功能.但是,有没有一种方法可以按期望的方式保存它,而无需对数据进行任何形式的转换?

I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

推荐答案

假定正确引用了多行数据,则可以使用univocity解析器和multiLine设置来解析多行csv数据

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting

sparkSession.read
  .option("parserLib", "univocity")
  .option("multiLine", "true")
  .csv(file)

请注意,这需要将整个文件作为单个执行程序读取,并且如果您的数据太大,则可能无法正常工作.标准文本文件读取将在执行任何其他解析之前按行将文件拆分,这将阻止您使用包含换行符的数据记录,除非可以使用其他记录定界符.如果不是这样,您可能需要实现自定义TextInputFormat来处理多行记录.

Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.

这篇关于如何处理Spark中的多行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆