使用 STL 仅保留 N 个最小元素(有重复项) [英] Keeping only N smallest elements with STL (with duplicates)

查看:24
本文介绍了使用 STL 仅保留 N 个最小元素(有重复项)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的 MVCE 尝试仅从随机元素(包含重复项)的大型传入输入流中按升序输出 5 个最小元素.

The MVCE below attempts to output only the 5 smallest elements in ascending order from a large incoming input stream of random elements (which contains duplicates).

int main(int argc, char *argv[])
{
    std::set<int> s;   //EDIT:  std::multiset is an answer to Q1

    for (int i : {6, 6, 5, 8, 3, 4, 0, 2, 8, 9, 7, 2})  //Billions of elements in reality
    {
        if ( (s.size() < 5) || (i <= *(--s.end())) )  //Insert only if not full or when the element to be inserted is smaller than the greatest one already in the set
        {
            if (s.size() >= 5)  //Limit the number of smallest elements that are kept. In reality ~8000
                s.erase(*(--s.end())); //Erase the largest element

            s.insert(i);
        }
    }

    for (int d: s)
        std::cout << d << " ";  //print the 5 smallest elements in ascending order      

    std::cout << '\n';

    return 0;
}

输出为:

0 2 3 4

输出应该是:

0 2 2 3 4

问题 1:必须更改哪些内容才能允许重复?
问题 2:如何在不浪费 GB 内存来存储所有输入元素的情况下使此代码更快?(代码现在太慢了,因为它是).

Q1: What must be changed to allow duplicates ?
Q2: How can this code be made faster while not wasting GBs of memory for storing all of the input elements? (the code is way too slow now, as it is).

推荐答案

这听起来像是经典的面试问题如何在不知道将要处理的数据的大小的情况下存储最小的 N 个项目?".

This sounds like the classical interview question "how to store the smallest N items, without knowledge of the size of the data that will be processed?".

>

一个答案是使用N个项目的最大堆,如果后续项目小于或等于最顶部项目,然后调整堆(移除顶部元素,添加新元素,堆化)堆.

One answer is to use a max-heap of N items, and then adjust the heap (remove the top element, add the new element, heapify) if the subsequent item is less than or equal to the top most item in the heap.

这可以使用 C++ 库函数轻松完成std::make_heapstd::pop_heapstd::push_heap.

This can be easily done using the C++ library functions std::make_heap, std::pop_heap, and std::push_heap.

这是一个例子:

#include <vector>
#include <algorithm>
#include <iostream>

int main(int argc, char *argv[])
{
    std::vector<int> s; 
    for (int i : {6, 6, 5, 8, 3, 4, 0, 2, 8, 9, 7, 2})
    {
        // add the first 5 elements to the vector
        if (s.size() < 5)
        {
            s.push_back(i);
            if ( s.size() == 5 )
                // make the max-heap of the 5 elements   
                std::make_heap(s.begin(), s.end());
            continue;
        }

        // now check if the next element is smaller than the top of the heap
        if (s.front() >= i)
        {
            // remove the front of the heap by placing it at the end of the vector
            std::pop_heap(s.begin(), s.end());

            // get rid of that item now 
            s.pop_back();

            // add the new item 
            s.push_back(i);

            // heapify
            std::push_heap(s.begin(), s.end());
        }
    }

    // sort the heap    
    std::sort_heap(s.begin(), s.end());

    for (int d : s)
        std::cout << d << " ";  //print the 5 smallest elements in ascending order      

    std::cout << '\n';

    return 0;
}

输出:

0 2 2 3 4

当然你可以把它变成一个函数,用N替换硬编码的5.

Of course you can make this a function and replace the hard-coded 5 with N.

如果有数十亿个元素,即比 N 多得多的元素,那么堆中唯一保留的就是 N 个元素.

If there are billions of elements, i.e. many more elements than N, the only thing that will be kept in the heap are N elements.

只有在检测到新项满足最小 N 元素之一时才会操作最大堆,这很容易通过检查堆中的顶部项并将其与正在处理的新项进行比较来完成处理.

The max-heap is only manipulated if it is detected that the new item satisfies being one of the smallest N elements, and that is easily done by inspecting the top item in the heap and comparing it with the new item that is being processed.

这篇关于使用 STL 仅保留 N 个最小元素(有重复项)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆