使用Scalaz Stream解析任务(替换Scalaz Iteratees) [英] Using Scalaz Stream for parsing task (replacing Scalaz Iteratees)

查看:90
本文介绍了使用Scalaz Stream解析任务(替换Scalaz Iteratees)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在许多项目中使用 Scalaz 7 的迭代器,主要用于处理大型文件.我想开始切换到Scalaz streams ,该流旨在替换iteratee软件包(坦率地说,它缺少很多内容)的碎片,使用起来有点痛苦.

I use Scalaz 7's iteratees in a number of projects, primarily for processing large-ish files. I'd like to start switching to Scalaz streams, which are designed to replace the iteratee package (which frankly is missing a lot of pieces and is kind of a pain to use).

流基于 机器 (迭代概念的另一种变体),它们具有在Haskell中也实现了.我曾经使用过Haskell机器库,但是机器和流之间的关系并不十分明显(至少对我而言),并且流库的文档是

Streams are based on machines (another variation on the iteratee idea), which have also been implemented in Haskell. I've used the Haskell machines library a bit, but the relationship between machines and streams isn't completely obvious (to me, at least), and the documentation for the streams library is still a little sparse.

这个问题是关于一个简单的解析任务,我希望看到它是使用流而不是迭代来实现的.如果没有其他人能击败我,我会自己回答这个问题,但是我敢肯定,我不是唯一一个正在(或至少考虑)进行这种过渡的人,并且由于我仍然需要完成此练习,因此我认为我也应该在公开场合这样做.

This question is about a simple parsing task that I'd like to see implemented using streams instead of iteratees. I'll answer the question myself if nobody else beats me to it, but I'm sure I'm not the only one who's making (or at least considering) this transition, and since I need to work through this exercise anyway, I figured I might as well do it in public.

假设我有一个文件,其中包含已被标记并带有词性标记的句子:

Supposed I've got a file containing sentences that have been tokenized and tagged with parts of speech:

no UH
, ,
it PRP
was VBD
n't RB
monday NNP
. .

the DT
equity NN
market NN
was VBD
illiquid JJ
. .

每行只有一个标记,单词和词性由一个空格分隔,空白行代表句子边界.我想解析该文件并返回一个句子列表,我们也可以将其表示为字符串元组列表:

There's one token per line, words and parts of speech are separated by a single space, and blank lines represent sentence boundaries. I want to parse this file and return a list of sentences, which we might as well represent as lists of tuples of strings:

List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.)

和往常一样,如果遇到无效的输入或文件读取异常,我们希望正常运行,我们不必担心手动关闭资源等问题.

As usual, we want to fail gracefully if we hit invalid input or file reading exceptions, we don't want to have to worry about closing resources manually, etc.

首先介绍一些一般的文件读取内容(它确实应该是iteratee程序包的一部分,该程序包目前尚无法提供如此高级的功能):

First for some general file reading stuff (that really ought to be part of the iteratee package, which currently doesn't provide anything remotely this high-level):

import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect.IO
import iteratee.{ Iteratee => I, _ }

type ErrorOr[A] = EitherT[IO, Throwable, A]

def tryIO[A, B](action: IO[B]) = I.iterateeT[A, ErrorOr, B](
  EitherT(action.catchLeft).map(I.sdone(_, I.emptyInput))
)

def enumBuffered(r: => BufferedReader) = new EnumeratorT[String, ErrorOr] {
  lazy val reader = r
  def apply[A] = (s: StepT[String, ErrorOr, A]) => s.mapCont(k =>
    tryIO(IO(Option(reader.readLine))).flatMap {
      case None       => s.pointI
      case Some(line) => k(I.elInput(line)) >>== apply[A]
    }
  )
}

def enumFile(f: File) = new EnumeratorT[String, ErrorOr] {
  def apply[A] = (s: StepT[String, ErrorOr, A]) => tryIO(
    IO(new BufferedReader(new FileReader(f)))
  ).flatMap(reader => I.iterateeT[String, ErrorOr, A](
    EitherT(
      enumBuffered(reader).apply(s).value.run.ensuring(IO(reader.close()))
    )
  ))
}

然后是我们的句子阅读器:

And then our sentence reader:

def sentence: IterateeT[String, ErrorOr, List[(String, String)]] = {
  import I._

  def loop(acc: List[(String, String)])(s: Input[String]):
    IterateeT[String, ErrorOr, List[(String, String)]] = s(
    el = _.trim.split(" ") match {
      case Array(form, pos) => cont(loop(acc :+ (form, pos)))
      case Array("")        => cont(done(acc, _))
      case pieces           =>
        val throwable: Throwable = new Exception(
          "Invalid line: %s!".format(pieces.mkString(" "))
        )

        val error: ErrorOr[List[(String, String)]] = EitherT.left(
          throwable.point[IO]
        )

        IterateeT.IterateeTMonadTrans[String].liftM(error)
    },
    empty = cont(loop(acc)),
    eof = done(acc, eofInput)
  )
  cont(loop(Nil))
}

最后是我们的解析动作:

And finally our parsing action:

val action =
  I.consume[List[(String, String)], ErrorOr, List] %=
  sentence.sequenceI &=
  enumFile(new File("example.txt"))

我们可以证明它有效:

scala> action.run.run.unsafePerformIO().foreach(_.foreach(println))
List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.))

我们完成了.

使用Scalaz流而非迭代程序实现的程序大致相同.

More or less the same program implemented using Scalaz streams instead of iteratees.

推荐答案

scalaz-stream解决方案:

A scalaz-stream solution:

import scalaz.std.vector._
import scalaz.syntax.traverse._
import scalaz.std.string._

val action = linesR("example.txt").map(_.trim).
  splitOn("").flatMap(_.traverseU { s => s.split(" ") match {
    case Array(form, pos) => emit(form -> pos)
    case _ => fail(new Exception(s"Invalid input $s"))
  }})

我们可以证明它有效:

scala> action.collect.attempt.run.foreach(_.foreach(println))
Vector((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
Vector((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.))

我们完成了.

traverseU函数是常见的Scalaz组合器.在这种情况下,它被用来遍历Process monad中由splitOn生成的句子Vector.等效于map,后跟sequence.

The traverseU function is a common Scalaz combinator. In this case it's being used to traverse, in the Process monad, the sentence Vector generated by splitOn. It's equivalent to map followed by sequence.

这篇关于使用Scalaz Stream解析任务(替换Scalaz Iteratees)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆