如何在不丢失分割线的情况下将相邻线与 scalaz-stream 合并 [英] How to merge adjacent lines with scalaz-stream without losing the splitting line

查看:40
本文介绍了如何在不丢失分割线的情况下将相邻线与 scalaz-stream 合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我的输入文件 myInput.txt 如下所示:

Suppose that my input file myInput.txt looks as follows:

~~~ text1
bla bla
some more text
~~~ text2
lorem ipsum
~~~ othertext
the wikipedia
entry is not
up to date

即有~~~分隔的文档.所需的输出如下:

That is, there are documents separated by ~~~. The desired output is as follows:

text1: bla bla some more text
text2: lorem ipsum 
othertext: the wikipedia entry is not up to date

我该怎么做?以下看起来很不自然,而且我失去了标题:

How do I go about that? The following seems pretty unnatural, plus I lose the titles:

 val converter: Task[Unit] =
    io.linesR("myInput.txt")
      .split(line => line.startsWith("~~~"))
      .intersperse(Vector("\nNew document: "))
      .map(vec => vec.mkString(" "))
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("flawedOutput.txt"))
      .run

  converter.run

推荐答案

以下工作正常,但如果我在超过一个玩具示例(处理 70MB 大约 5 分钟)上运行它,它会非常慢.那是因为我到处都在创建 Process 吗?此外,它似乎只使用了一个内核.

The following works fine, but it is insanely slow if I run it on more than a toy example (~5 minutes to process 70MB). Is that because I am creating Process's all over the place? Also, it seems to be using only a single core.

  val converter2: Task[Unit] = {
    val docSep = "~~~"
    io.linesR("myInput.txt")
      .flatMap(line => { val words = line.split(" ");
          if (words.length==0 || words(0)!=docSep) Process(line)
          else Process(docSep, words.tail.mkString(" ")) })
      .split(_ == docSep)
      .filter(_ != Vector())
      .map(lines => lines.head + ": " + lines.tail.mkString(" "))
      .intersperse("\n")
      .pipe(text.utf8Encode)
      .to(io.fileChunkW("correctButSlowOutput.txt"))
      .run
  }

这篇关于如何在不丢失分割线的情况下将相邻线与 scalaz-stream 合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆