从数字流存储最​​多5000号 [英] Store the largest 5000 numbers from a stream of numbers

查看:127
本文介绍了从数字流存储最​​多5000号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下问题:

  

从数字流存储最​​多5000号

其中弹簧想到的解决方案是一个二分搜索树保持树中节点的数目和一个参照最小节点一旦计数的计数到达5000当计数值达到5000时,每一个新的号码添加罐是相比在树中最小项。如果大于,则新的号码可加入随后的最小移除,并且新的最小计算(这应该是很简单的已经具有previous最小)。

我的这种解决方案值得关注的是二叉树自然会得到扭曲的(因为我只能删除一侧)。

有没有办法来解决这个问题,不会产生可怕的扭曲的树?

在万一有人需要它,我已经包括伪$ C $下我的解决方案,到目前为止如下:

 程序(数量)
{
  如果(计数== 5000和放大器;&功放;数字> smallest.Value)
  {
    ADDNODE(根号)
    最小= deleteNodeAndGetNewSmallest(根,最小)
  }
}

deleteNodeAndGetNewSmallest(lastSmallest)
{
  如果(lastSmallest有父)
  {
    如果(lastSmallest有右子)
    {
      最小= getMin(lastSmallest.right)
      lastSmallest.parent.right = lastSmallest.right
    }
    其他
    {
      最小= lastSmallest.parent
    }
  }
  其他
  {
    最小= getMin(lastSmallest.right)
    根= lastSmallest.right
  }
  计数 - 
  返回最小
}

getMin(节点)
{
  如果(节点已经离开)
    返回getMin(node.left)
  其他
    返回节点
}

添加(号码)
{
  //标准的实施增加了BST
  算上++
}
 

解决方案

这个最简单的办法是保持一个最小的堆最大尺寸5000

  • 在一个新的数字到达每一次 - 检查堆越小则 5000,如果是 - 添加
  • 如果它不是 - 检查是否最小越小则新 元,如果是,弹出出来并插入新的元素代替。
  • 当你完成 - 你有一个包含5000最大的元素堆

这个解决方案是 O(nlogk)的复杂性,其中 N 是元素的数量和<$ C $ ç> K 的元素,你需要(在你的情况下,5000)的数量。

它也可以在完成为O(n)使用的选择算法 - 存储所有元素,然后找到5001th最大元素,并返回一切都大于它。但是这是很难实现和合理的规模投入 - 或许不是更好。另外,如果流包含重复,需要更多的处理。

Given the following problem:

"Store the largest 5000 numbers from a stream of numbers"

The solution which springs to mind is a binary search tree maintaining a count of the number of nodes in the tree and a reference to the smallest node once the count reaches 5000. When the count reaches 5000, each new number to add can be compared to the smallest item in the tree. If greater, the new number can be added then the smallest removed and the new smallest calculated (which should be very simple already having the previous smallest).

My concern with this solution is that the binary tree is naturally going to get skewed (as I'm only deleting on one side).

Is there a way to solve this problem which won't create a terribly skewed tree?

In case anyone wants it, I've included pseudo-code for my solution so far below:

process(number)
{
  if (count == 5000 && number > smallest.Value)
  {
    addNode( root, number)
    smallest = deleteNodeAndGetNewSmallest ( root, smallest)
  }
}

deleteNodeAndGetNewSmallest( lastSmallest)
{
  if ( lastSmallest has parent)
  {
    if ( lastSmallest has right child)
    {
      smallest = getMin(lastSmallest.right)
      lastSmallest.parent.right = lastSmallest.right
    }
    else
    {
      smallest = lastSmallest.parent
    }
  }
  else 
  {
    smallest = getMin(lastSmallest.right)
    root = lastSmallest.right
  }
  count--
  return smallest
}

getMin( node)
{
  if (node has left)
    return getMin(node.left)
  else
    return node
}

add(number)
{
  //standard implementation of add for BST
  count++
}

解决方案

The simplest solution for this is maintaining a min heap of max size 5000.

  • Every time a new number arrives - check if the heap is smaller then 5000, if it is - add it.
  • If it is not - check if the minimum is smaller then the new element, and if it is, pop it out and insert the new element instead.
  • When you are done - you have a heap containing 5000 largest elements.

This solution is O(nlogk) complexity, where n is the number of elements and k is the number of elements you need (5000 in your case).

It can be done also in O(n) using selection algorithm - store all the elements, and then find the 5001th largest element, and return everything bigger than it. But it is harder to implement and for reasonable size input - might not be better. Also, if stream contains duplicates, more processing is needed.

这篇关于从数字流存储最​​多5000号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆