在使用Scala解析的CSV文件中处理多余的换行符? [英] Handling extra newlines in csv files parsed with Scala?

查看:408
本文介绍了在使用Scala解析的CSV文件中处理多余的换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Scala完全陌生,正在尝试解析一个CSV文件,该文件在某些​​单元格中(例如,双引号中)包含回车符/换行符和其他特殊字符(例如逗号):

I'm totally new to Scala, and am trying to parse a CSV file that has carriage return/new line/and other special characters like comma in some of the cells (i.e. within double quotations), for example:

"A","B","C\n,FF\n","D"\n
"Q","W","E","R\n\n"\n
"1","2\n","2","2,2\n"\n

我要将其加载到Scala中的列表类型列表中,例如以下内容:

I want to load this into a list of lists type in Scala, like the following:

List(List("A","B","C,FF","D"),List("Q","W","E","R"),List("1","2","2","2,2"))

有什么建议可以做到吗?

Any suggestions how it can be done?

我已经找到了一些解决方案其他语言的问题。例如,这是Python中的一个很棒的工具,我很了解:在用Python解析的csv文件中处理多余的换行符(回车)?

I have found some solutions for the same problem in other languages. For example this is a great one in Python, which I understand well: Handling extra newlines (carriage returns) in csv files parsed with Python?

我的尝试:

val src2 = Source.fromFile("sourceFileName.csv")
val it =src2.getLines()
val data = for (i<-it) yield i.replace("\"","").split(",")

但是看起来所有回车符都被视为换行符。

But it looks like all carriage returns are seen as new lines.

推荐答案

在我看来,如果实际单元格包含换行符,那么在遍历 getLines 时需要保持一些状态。您可以使用 foldLeft 或类似的运算符。如果文件足够小,您还可以使用 mkString 将整个文件作为字符串存储在内存中,然后对其进行操作。日每个单元格中都用引号引起来。例如:

It seems to me that if the actual cells contain newlines, then you'll need to keep some state while traversing getLines. You can do this using a foldLeft or similar operator. If the file is small enough, you can also use mkString to get the whole file as a string in memory and then operate on that. The following simplified version assumes that every cell is surrounded by quotes. For example:

val converted = Source.fromFile(sourceFileName).mkString.replaceAll("\n", "").replaceAll("\"\"", "\"\n\"")

首先,我们要删除所有新行。然后,真正的新行将连续显示为两个引号(因为否则会出现逗号分隔引号),因此我们在引号之间添加新行。然后我们应该拥有文件的规范化版本,并且可以进行简单的操作:

First, we're removing all new lines. Then, the true new lines will manifest as two quotes in a row (since otherwise there would be a comma separating the quotes), so we add back the new lines between the quotes. Then we should have a normalized version of the file, and we can procede with simple operations:

converted.split("\n").map(_.split(",").map(_.replaceAll("\"", "")))

这篇关于在使用Scala解析的CSV文件中处理多余的换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆