使用 Scala 解析器组合器解析 CSV 文件 [英] Use Scala parser combinator to parse CSV files

查看:20
本文介绍了使用 Scala 解析器组合器解析 CSV 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scala 解析器组合器编写 CSV 解析器.语法基于 RFC4180.我想出了以下代码.它几乎可以工作,但我无法正确分离不同的记录.我错过了什么?

I'm trying to write a CSV parser using Scala parser combinators. The grammar is based on RFC4180. I came up with the following code. It almost works, but I cannot get it to correctly separate different records. What did I miss?

object CSV extends RegexParsers {
  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }
  def CR      = "
"
  def LF      = "
"
  def CRLF    = "
"
  def TXT     = "[^",
]".r
  
  def file: Parser[List[List[String]]] = ((record~((CRLF~>record)*))<~(CRLF?)) ^^ { 
    case r~rs => r::rs
  }
  def record: Parser[List[String]] = (field~((COMMA~>field)*)) ^^ {
    case f~fs => f::fs
  }
  def field: Parser[String] = escaped|nonescaped
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}


println(CSV.parse(""" "foo", "bar", 123""" + "
" + 
  "hello, world, 456" + "
" +
  """ spam, 789, egg"""))

// Output: List(List(foo, bar, 123hello, world, 456spam, 789, egg)) 
// Expected: List(List(foo, bar, 123), List(hello, world, 456), List(spam, 789, egg))

更新:问题已解决

默认的 RegexParsers 使用正则表达式 [s]+ 忽略空格,包括空格、制表符、回车和换行符.上面解析器无法分离记录的问题就是因为这个.我们需要禁用skipWhitespace 模式.将 whiteSpace 定义替换为 [ ]} 并不能解决问题,因为它会忽略字段中的所有空格(因此 CSV 中的foo bar"变成了foobar"),这是不希望的.解析器的更新源因此是

Update: problem solved

The default RegexParsers ignore whitespaces including space, tab, carriage return, and line breaks using the regular expression [s]+. The problem of the parser above unable to separate records is due to this. We need to disable skipWhitespace mode. Replacing whiteSpace definition to just [ ]} does not solve the problem because it will ignore all spaces within fields (thus "foo bar" in the CSV becomes "foobar"), which is undesired. The updated source of the parser is thus

import scala.util.parsing.combinator._

// A CSV parser based on RFC4180
// https://www.rfc-editor.org/rfc/rfc4180

object CSV extends RegexParsers {
  override val skipWhitespace = false   // meaningful spaces in CSV

  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }  // combine 2 dquotes into 1
  def CRLF    = "
" | "
"
  def TXT     = "[^",
]".r
  def SPACES  = "[ 	]+".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ (CRLF?)

  def record: Parser[List[String]] = repsep(field, COMMA)

  def field: Parser[String] = escaped|nonescaped


  def escaped: Parser[String] = {
    ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ { 
      case ls => ls.mkString("")
    }
  }

  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }



  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case e => throw new Exception(e.toString)
  }
}

推荐答案

你错过的是空格.我投入了一些额外的改进.

What you missed is whitespace. I threw in a couple bonus improvements.

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ 	]""".r

  def COMMA   = ","
  def DQUOTE  = """
  def DQUOTE2 = """" ^^ { case _ => """ }
  def CR      = "
"
  def LF      = "
"
  def CRLF    = "
"
  def TXT     = "[^",
]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

这篇关于使用 Scala 解析器组合器解析 CSV 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆