文件中重复单词的迭代器 [英] Iterator of repeated words in a file

查看:91
本文介绍了文件中重复单词的迭代器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设,我正在编写一个函数来在文本文件中查找重复的单词".例如,在aaa aaa bb cc cc bb dd中重复的单词是aaacc,但不是bb,因为两个bb实例不会彼此相邻出现.

Suppose, I'm writing a function to find "repeated words" in a text file. For example, in aaa aaa bb cc cc bb dd repeated words are aaa and cc but not bb, because two bb instances don't appear next to each other.

该函数接收一个迭代器,并返回如下迭代器:

The function receives an iterator and returns iterator like that:

def foo(in: Iterator[String]): Iterator[String] = ???

foo(Iterator("aaa", "aaa", "bb", "cc", "cc", "bb")) // Iterator("aaa", "cc")
foo(Iterator("a", "a", "a", "b", "c", "b"))         // Iterator("a")

您将如何编写foo?请注意,输入量很大,所有单词都不适合存储(但是重复单词的数量相对较小).

How would you write foo ? Note that the input is huge and all words do not fit in memory (but the number of repeated words is relatively small).

P.S.我还要在以后增强foo以返回重复单词的位置,重复次数等.

P.S. I would like also to enhance foo later to return also positions of the repeated words, the number of repetitions, etc.

推荐答案

更新:

然后确定.让我们指定您想要的内容:

OK then. Let specify bit what you want:

 input       | expected    
             |             
 a           |             
 aa          | a           
 abc         |             
 aabc        | a           
 aaabbbbbbc  | ab          
 aabaa       | aa          
 aabbaa      | aba         
 aabaa       | aa    

是真的吗?如果是这样,这是可行的解决方案.不确定性能,但至少它是惰性的(不要将所有内容加载到内存中).

Is it true? If so this is working solution. Not sure about performance but at least it is lazy (don't load everything into memory).


//assume we have no nulls in iterator.
def foo[T >: Null](it:Iterator[T]) = {
  (Iterator(null) ++ it).sliding(3,1).collect {
    case x @ Seq(a,b,c) if b == c && a != b => c
  }
}

我们需要这个难看的Iterator(null) ++,因为我们正在寻找3个元素,并且需要一种方法来查看前两个元素是否相同.

We need this ugly Iterator(null) ++ because we are looking for 3 elements and we need a way to see if first two are the same.

这是纯实现,与命令式命令相比有一些优势(例如,在其他答案中).最重要的是它很懒:

This is pure implementation and it has some advantages over imperative one (eg. in other answers). Most important one is that it is lazy:

//infinite iterator!!!
val it = Iterator.iterate('a')(s => (s + (if(Random.nextBoolean) 1 else 0)).toChar)
//it'll take only as much as needs to take this 10 items.
//should not blow up
foo(it).take(10)
//imperative implementation will blow up in such situation.
fooImp(it).take(10)

这是此主题中的所有实现以及本主题中其他帖子的所有实现: https://scalafiddle.io/sf/w5yozTA/15

here are all implementations from this and other posts seen in this topic: https://scalafiddle.io/sf/w5yozTA/15

索引和位置

在评论中,您询问添加重复单词的数量及其索引是否容易.我想了一会儿,我做了这样的事情.不知道它是否具有出色的性能,但是应该是懒惰的(例如,应该适用于大文件).

In comment you have asked if it would be easy to add the number of repeated words and their indices. I thought about it for a while and i've made something like this. Not sure if it has great performance but it should be lazy (eg. should work for big files).

/** returns Iterator that replace consecutive items with (item, index, count). 
It contains all items from orginal iterator.  */
def pack[T >: Null](it:Iterator[T]) = {
  //Two nulls, each for one sliding(...) 
  (Iterator(null:T) ++ it ++ Iterator(null:T))
  .sliding(2,1).zipWithIndex
  //skip same items
  .filter { case (x, _) => x(0) != x(1) }
  //calculate how many items was skipped
  .sliding(2,1).collect {
    case Seq((a, idx1), (b, idx2)) => (a(1), idx1 ,idx2-idx1)  
  }
}

def foo[T >: Null](it:Iterator[T]) = pack(it).filter(_._3 > 1)

旧答案(在更新问题之前)

另一个(更简单的)解决方案可能是这样的:

Another (simpler) solution could be something like this:

import scala.collection.immutable._

//Create new iterator each time we'll print it.
def it = Iterator("aaa", "aaa", "bb", "cc", "cc", "bb", "dd", "dd", "ee",  "ee", "ee", "ee", "ee", "aaa", "aaa", "ff", "ff", "zz", "gg", "aaa", "aaa")

//yep... this is whole implementation :)
def foo(it:Iterator[String]) = it.sliding(2,1).collect { case Seq(a,b) if a == b => a } 


println(foo(it).toList) //dont care about duplication
//List(aaa, cc, dd, ee, ee, ee, ff)

println(foo(it).toSet) //throw away duplicats but don't keeps order
//Set(cc, aaa, ee, ff, dd)

println(foo(it).to[ListSet]) //throw away duplicats and keeps order
//ListSet(aaa, cc, dd, ee, ff)

//oh... and keep result longer than 5 items while testing. 
//Scala collections (eg: Sets) behaves bit diffrently up to this limit (they keeps order)
//just test with bit bigger Sequences :)

https://scalafiddle.io/sf/w5yozTA/1

(如果答案有帮助,请投票)

(if answer is helpful up-vote please)

这篇关于文件中重复单词的迭代器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆