列表中的前n个项目(包括重复项) [英] Top n items in a List ( including duplicates )

查看:101
本文介绍了列表中的前n个项目(包括重复项)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图找到一种有效的方法来获取非常大的列表中可能包含重复项的前N个项目.

Trying to find an efficient way to obtain the top N items in a very large list, possibly containing duplicates.

我首先尝试了排序&切片,有效.但这似乎是不必要的.如果只需要前20名成员,则无需对非常大的列表进行排序. 因此,我编写了一个构建前n个列表的递归例程.这也可以,但是比非递归的要慢得多!

I first tried sorting & slicing, which works. But this seems unnnecessary. You shouldn't need to sort a very large list if you just want the top 20 members. So I wrote a recursive routine which builds the top-n list. This also works, but is very much slower than the non-recursive one!

问题:我的第二个例程(elite2)比Elite慢得多吗?如何使它更快?我的代码附在下面.谢谢.

Question: Which is my second routine (elite2) so much slower than elite, and how do I make it faster ? My code is attached below. Thanks.

import scala.collection.SeqView
import scala.math.min
object X {

    def  elite(s: SeqView[Int, List[Int]], k:Int):List[Int] = {
        s.sorted.reverse.force.slice(0,min(k,s.size))
    }

    def elite2(s: SeqView[Int, List[Int]], k:Int, s2:List[Int]=Nil):List[Int] = {
        if( k == 0 || s.size == 0) s2.reverse
        else {
            val m = s.max
            val parts = s.force.partition(_==m)
            val whole = if( parts._1.size > 1) parts._1.tail:::parts._2 else parts._2
            elite2( whole.view, k-1, m::s2 )
        }
    }

    def main(args:Array[String]) = {
        val N = 1000000/3
        val x = List(N to 1 by -1).flatten.map(x=>List(x,x,x)).flatten.view
        println(elite2(x,20))
        println(elite(x,20))
    }
}

推荐答案

除非我遗漏了一些东西,为什么不遍历列表并随手挑选前20名呢?只要您跟踪前20个元素中的最小元素,就不会有开销,除非添加到前20个元素(对于长列表而言,这种情况相对很少).这是一个实现:

Unless I'm missing something, why not just traverse the list and pick the top 20 as you go? So long as you keep track of the smallest element of the top 20 there should be no overhead except when adding to the top 20, which should be relatively rare for a long list. Here's an implementation:

  def topNs(xs: TraversableOnce[Int], n: Int) = {
    var ss = List[Int]()
    var min = Int.MaxValue
    var len = 0
    xs foreach { e =>
      if (len < n || e > min) {
        ss = (e :: ss).sorted
        min = ss.head
        len += 1
      }
      if (len > n) {
        ss = ss.tail
        min = ss.head
        len -= 1
      }                    
    }
    ss
  }  

(已编辑,因为我最初使用的是SortedSet,但没有意识到您想保留重复项.)

(edited because I originally used a SortedSet not realising you wanted to keep duplicates.)

我以此为基准列出了100k个随机整数,平均花费40毫秒.您的elite方法大约需要850毫秒,而您的elite2方法大约需要4100毫秒.因此,这比您的最快速度快20倍以上.

I benchmarked this for a list of 100k random Ints, and it took on average 40 ms. Your elite method takes about 850 ms and and your elite2 method takes about 4100 ms. So this is over 20 x quicker than your fastest.

这篇关于列表中的前n个项目(包括重复项)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆