在使用Scala解析的CSV文件中处理多余的换行符? [英] Handling extra newlines in csv files parsed with Scala?
问题描述
我对Scala完全陌生,正在尝试解析一个CSV文件,该文件在某些单元格中(例如,双引号中)包含回车符/换行符和其他特殊字符(例如逗号):
I'm totally new to Scala, and am trying to parse a CSV file that has carriage return/new line/and other special characters like comma in some of the cells (i.e. within double quotations), for example:
"A","B","C\n,FF\n","D"\n
"Q","W","E","R\n\n"\n
"1","2\n","2","2,2\n"\n
我要将其加载到Scala中的列表类型列表中,例如以下内容:
I want to load this into a list of lists type in Scala, like the following:
List(List("A","B","C,FF","D"),List("Q","W","E","R"),List("1","2","2","2,2"))
有什么建议可以做到吗?
Any suggestions how it can be done?
我已经找到了一些解决方案其他语言的问题。例如,这是Python中的一个很棒的工具,我很了解:在用Python解析的csv文件中处理多余的换行符(回车)?
I have found some solutions for the same problem in other languages. For example this is a great one in Python, which I understand well: Handling extra newlines (carriage returns) in csv files parsed with Python?
我的尝试:
val src2 = Source.fromFile("sourceFileName.csv")
val it =src2.getLines()
val data = for (i<-it) yield i.replace("\"","").split(",")
但是看起来所有回车符都被视为换行符。
But it looks like all carriage returns are seen as new lines.
推荐答案
在我看来,如果实际单元格包含换行符,那么在遍历 getLines
时需要保持一些状态。您可以使用 foldLeft
或类似的运算符。如果文件足够小,您还可以使用 mkString
将整个文件作为字符串存储在内存中,然后对其进行操作。日每个单元格中都用引号引起来。例如:
It seems to me that if the actual cells contain newlines, then you'll need to keep some state while traversing getLines
. You can do this using a foldLeft
or similar operator. If the file is small enough, you can also use mkString
to get the whole file as a string in memory and then operate on that. The following simplified version assumes that every cell is surrounded by quotes. For example:
val converted = Source.fromFile(sourceFileName).mkString.replaceAll("\n", "").replaceAll("\"\"", "\"\n\"")
首先,我们要删除所有新行。然后,真正的新行将连续显示为两个引号(因为否则会出现逗号分隔引号),因此我们在引号之间添加新行。然后我们应该拥有文件的规范化版本,并且可以进行简单的操作:
First, we're removing all new lines. Then, the true new lines will manifest as two quotes in a row (since otherwise there would be a comma separating the quotes), so we add back the new lines between the quotes. Then we should have a normalized version of the file, and we can procede with simple operations:
converted.split("\n").map(_.split(",").map(_.replaceAll("\"", "")))
这篇关于在使用Scala解析的CSV文件中处理多余的换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!