在集合中查找重复元素并将它们分组的快速算法是什么? [英] what are the fast algorithms to find duplicate elements in a collection and group them?

查看:14
本文介绍了在集合中查找重复元素并将它们分组的快速算法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您有一组元素,您如何挑选出重复的元素并将它们放入每个组中以最少的比较?最好用 C++,但算法比语言更重要.例如给定 {E1,E2,E3,E4,E4,E2,E6,E4,E3},我想提取 {E2,E2}, {E3,E3}, {E4,E4,E4}.你会选择什么数据结构和算法?还请包括设置数据结构的成本,例如,如果它是像 std::multimap 这样的预排序结构

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

按照建议使事情更清楚.有一个限制:元素必须自己比较以确保它们是重复的.

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

所以哈希不适用,因为实际上它们将比较从重元素(例如数据块)转移到轻元素(整数),并减少了一些比较,但没有取消它们,最后,我们回到我们最初的问题,什么时候在一个碰撞桶内.

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

假设您有一堆潜在的 GB 重复文件,它们具有人类所知道的每种哈希算法相同的哈希值.现在您将发现真正的重复项.

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

不,这不可能是现实生活中的问题(即使 MD5 也足以为现实生活中的文件生成唯一的哈希).但只是假装这样我们就可以专注于寻找涉及最少比较的数据结构+算法.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

我正在做的是

  1. 表示成一个 STL std::list 数据结构(在这方面 1)它的元素删除比向量便宜 2)它的插入更便宜,不需要排序.)

  1. represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)

弹出一个元素并与其余元素进行比较,如果发现重复,则将其从列表中拉出.一旦到达列表的末尾,就会找到一组重复项(如果有).

pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.

重复以上两步,直到列表为空.

repeat the above 2 steps until the list is empty.

在最好的情况下它需要 N-1,但是 (N-1)!在最坏的情况下.

It needs N-1 in the best case, but (N-1)! in the worse case.

有什么更好的选择?

我的代码使用上面解释的方法:

My code using method explained above:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

<小时>

感谢下面的回答,但是他们似乎被我的例子误导了,它是关于整数的.实际上元素是类型不可知的(不一定是整数、字符串或任何其他 POD),并且相等的谓词是自定义的,即 比较可能非常繁重.


Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

所以也许我的问题应该是:使用哪种数据结构+算法涉及的比较更少.

根据我的测试,使用 multiset 之类的预排序容器,multimap 并不是更好,因为

Using a pre-sorted container like multiset, multimap is not better according to my test, since

  1. 插入时的排序已经进行了比较,
  2. 下面的相邻发现再次进行比较,
  3. 这些数据结构更喜欢小于运算而不是相等运算,它们执行 2 个小于(a
  1. the sorting while inserting already does the comparisons,
  2. the following adjacent finding does comparison again,
  3. these data structure prefer less-than operations to equal operations, they perform 2 less-than(a

我看不出它如何保存比较.

I do not see how it can save comparisons.

下面的一些答案忽略了另一件事,我需要将重复的组彼此区分开来,而不仅仅是将它们保留在容器中.

one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

经过所有讨论,似乎有3种方法

After all the discussion, there seem to be 3 ways

  1. 我原来的天真方法如上所述
  2. 从像 std::vector 这样的线性容器开始,对其进行排序,然后找到相等的范围
  3. 从像 std::map<Type, vector 这样的关联容器开始,按照 Charles Bailey 的建议在关联容器的设置过程中挑选出重复项.
  1. my original naive method as explained above
  2. Start with a linear container like std::vector , sort it and then locate the equal ranges
  3. start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.

我编写了一个示例来测试下面发布的所有方法.

I've coded a sample to test all the methods as posted below.

重复的数量和分布的时间可能会影响最佳选择.

the number of duplicates and when they are distributed may influence the best choice.

  • 方法 1 最好是在前面重重摔倒,最后最差.排序不会改变分布,但会改变字节序.
  • 方法 3 的性能最平均
  • 方法 2 绝不是最佳选择

以下代码的一个输出,包含 20 个示例项目.

one output with 20 sample items from the code below.

使用 [ 20 10 6 5 4 3 2 2 2 2 1 进行测试1 1 1 1 1 1 1 1 1 ]

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

和 [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 34 5 6 10 20 ] 分别

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

使用 std::vector -> sort() ->相邻查找():

using std::vector -> sort() -> adjacent_find():

比较:['<'= 139,'==' = 23]

comparisons: [ '<' = 139, '==' = 23 ]

比较:['<'= 38, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

使用 std::list -> sort() -> 收缩列表:

using std::list -> sort() -> shrink list:

比较:['<'= 50, '==' = 43 ]

comparisons: [ '<' = 50, '==' = 43 ]

比较:['<'= 52, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

使用 std::list -> 缩小列表:

using std::list -> shrink list:

比较:['<'= 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 121 ]

比较:['<'= 0, '==' = 43 ]

comparisons: [ '<' = 0, '==' = 43 ]

使用 std::vector -> std::map>:

using std::vector -> std::map>:

比较:['<'= 79, '==' = 0 ]

comparisons: [ '<' = 79, '==' = 0 ]

比较:['<'= 53, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

使用 std::vector ->std::multiset ->相邻查找():

using std::vector -> std::multiset -> adjacent_find():

比较:['<'= 79, '==' = 7 ]

comparisons: [ '<' = 79, '==' = 7 ]

比较:['<'= 53, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]
";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("

Test %1% sample elements
using method%2%:
") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]
";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

推荐答案

您可以使用从代表元素到其他元素的列表/向量/双端队列的映射.这需要在插入容器时进行相对较少的比较,这意味着您可以遍历结果组而无需执行任何比较.

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

这个例子总是将第一个代表元素插入到映射的双端队列存储中,因为它使得通过组的后续迭代在逻辑上很简单,但是如果这种重复证明有问题,那么只执行 push_backif (!ins_pair.second).

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

然后(相对)简单且便宜地遍历组:

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

我进行了一些比较和对象计数的实验.在对 100000 个对象以随机顺序形成 50000 个组(即每组平均 2 个对象)的测试中,上述方法花费了以下数量的比较和副本:

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(我尝试减少副本的数量,但只是以牺牲比较为代价才真正做到了,这在您的场景中似乎是成本较高的操作.)

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

使用多重映射,并在最后一次迭代中保留前一个元素来检测组转换的成本如下:

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

使用单个列表并弹出最前面的元素并对其他组成员执行线性搜索成本:

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

是的,与我的最佳方法相比,这是 ~1.9b 比较与 ~1.6m 比较.为了让 list 方法在接近最佳比较次数的任何地方执行,它必须进行排序,这将花费与构建固有有序容器相似的比较次数.

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

编辑

我使用了你发布的代码并在我之前使用的相同测试数据集上运行了隐含的算法(我必须对代码做出一些假设,因为那里有一些假设的定义)并计算了:

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

即与我的哑列表算法完全相同的比较次数,但有更多的副本.我认为这意味着我们在这种情况下使用基本相似的算法.我看不到任何替代排序顺序的证据,但看起来您想要一个包含多个等效元素的组列表.这可以通过添加 if (i->size > 1) 子句在我的 Iterate 函数中简单地实现.

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

我仍然看不到任何证据表明构建排序容器(例如这种 deques 映射)不是一个好的(即使不是最优的)策略.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

这篇关于在集合中查找重复元素并将它们分组的快速算法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆