如何用换行符解析文件,用\转义并且不加引号 [英] How to parse a file with newline character, escaped with \ and not quoted

查看:252
本文介绍了如何用换行符解析文件,用\转义并且不加引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

读取和解析CSV文件时遇到问题.某些记录具有换行符,用\转义",并且该记录未被引用.该文件可能如下所示:

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:

Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;

我尝试使用sc.textFile("file.csv")sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")

但是,不管我如何阅读,到达"\\ n" si时都会创建一条记录/行/行.因此,我得到的不是三个,而是前一个文件的2个记录:

However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:

[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)

预期结果是:

[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)

(换行符号在记录中的保存方式并不那么重要,主要问题是具有正确的记录/行集)

(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)

关于如何做到这一点的任何想法?无需修改原始文件,最好不进行任何后处理/重新处理(例如,读取文件并过滤字段数量少于预期数量的任何行并将它们串联在一起可能是一个解决方案,但并非完全是最佳方案)

Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)

我的希望是使用databrick的csv解析器将转义字符设置为\(这应该是默认设置),但这没有用[出现错误提示 java.io.IOException: EOF whilst processing escape sequence].

My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work [got an error saying java.io.IOException: EOF whilst processing escape sequence].

我应该以某种方式扩展解析器并编辑一些内容,以创建自己的解析器吗?哪个是最好的解决方案?

Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?

谢谢!

编辑:忘记了,我使用的是Spark 1.6

EDIT: Forgot to mention, i'm using spark 1.6

推荐答案

wholeTextFiles api应该是救援者api.它读取文件作为键,值对:键作为文件的路径,值作为文件的整个文本.您将必须进行一些替换和拆分才能获得所需的输出

wholeTextFiles api should be a rescuer api in your case. It read files as key, value pairs : key as the path of the file and value as the whole text of the file. You will have to do some replacements and splittings to get the desired output though

val rdd = sparkSession.sparkContext.wholeTextFiles("path to the file")
                .flatMap(x => x._2.replace("\\\n", "").replace(";\n", "\n").split("\n"))
                .map(x => x.split(";"))

rdd输出为

[Line1field1,Line1field2.1 Line1field2.2,Line1field3]
[Line2FIeld1,Line2field2,Line2field3]

这篇关于如何用换行符解析文件,用\转义并且不加引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆