Scala 解析器组合器和换行符分隔的文本 [英] Scala parser combinators and newline-delimited text

查看:39
本文介绍了Scala 解析器组合器和换行符分隔的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Scala 解析器组合语法,它读取以换行符分隔的单词列表,其中列表由一个或多个空行分隔.给定以下字符串:

I am writing a Scala parser combinator grammar that reads newline-delimited word lists, where lists are separated by one or more blank lines. Given the following string:

cat
mouse
horse

apple
orange
pear

我想让它返回List(List(cat, mouse, horse), List(apple, orange, pear)).

我写了这个基本语法,它将单词列表视为换行符分隔的单词.请注意,我必须覆盖 whitespace 的默认定义.

I wrote this basic grammar which treats word lists as newline-delimited words. Note that I had to override the default definition of whitespace.

import util.parsing.combinator.RegexParsers

object WordList extends RegexParsers {

    private val eol = sys.props("line.separator")

    override val whiteSpace = """[ \t]+""".r

    val list: Parser[List[String]] = repsep( """\w+""".r, eol)

    val lists: Parser[List[List[String]]] = repsep(list, eol)

    def main(args: Array[String]) {
        val s =
          """cat
            |mouse
            |horse
            |
            |apple
            |orange
            |pear""".stripMargin

        println(parseAll(lists, s))
    }
}

这错误地将空行视为空单词列表,即它返回

This incorrectly treats blank lines as empty word lists, i.e. it returns

[8.1] parsed: List(List(cat, mouse, horse), List(), List(apple, orange, pear))

(注意中间的空列表.)

(Note the empty list in the middle.)

我可以在每个列表的末尾放置一个可选的行尾.

I can put an optional end of line at the end of each list.

val list: Parser[List[String]] = repsep( """\w+""".r, eol) <~ opt(eol)

这处理了列表之间只有一个空行的情况,但对于多个空行也有同样的问题.

This handles the case where there is a single blank line between lists, but has the same problem with multiple blank lines.

我尝试更改 lists 定义以允许多个行尾分隔符:

I tried changing the lists definition to allow multiple end-of-line delimiters:

val lists:Parser[List[List[String]]] = repsep(list, rep(eol))

但这挂在上面的输入上.

but this hangs on the above input.

将多个空行作为分隔符处理的正确语法是什么?

What is the correct grammar that will handle multiple blank lines as delimiters?

推荐答案

你应该尝试设置 skipWhitespacefalse 而不是重新定义空格的定义.您在空列表中遇到的问题是由 repsep 不消耗列表末尾的换行符这一事实引起的.相反,您应该在每个项目之后解析换行符(或可能是输入的结尾):

You should try setting skipWhitespace to false instead of redefining the definition of whitespace. The issue you're having with the empty list is caused by the fact that repsep doesn't consume the line break at the end of the list. Instead, you should parse the line break (or possibly end of input) after each item:

import util.parsing.combinator.RegexParsers

object WordList extends RegexParsers {

  private val eoi = """\z""".r // end of input
  private val eol = sys.props("line.separator")
  private val separator = eoi | eol
  private val word = """\w+""".r

  override val skipWhitespace = false

  val list: Parser[List[String]] = rep(word <~ separator)

  val lists: Parser[List[List[String]]] = repsep(list, rep1(eol))

  def main(args: Array[String]) {
    val s =
      """cat
        |mouse
        |horse
        |
        |apple
        |orange
        |pear""".stripMargin

    println(parseAll(lists, s))
  }

}

再说一次,解析器组合器在这里有点矫枉过正.你可以用更简单的东西获得几乎相同的东西(但使用数组而不是列表):

Then again, parser combinators are a bit overkill here. You could get practically the same thing (but with Arrays instead of Lists) with something much simpler:

s.split("\n{2,}").map(_.split("\n"))

这篇关于Scala 解析器组合器和换行符分隔的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆