什么是快速算法在集合中查找重复元素并将其分组？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

查看：234 发布时间：2017/7/20 23:15:20 c++ algorithm elements duplicate-data

本文介绍了什么是快速算法在集合中查找重复元素并将其分组？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

说你有一个元素的集合，你如何挑选出重复的元素，并把它们放在每个组中，最少的比较？最好在C ++中，但算法比语言更重要。
对于给定{E1，E2，E3，E4，E4，E2，E6，E4，E3}的示例
，我想提取{E2，E2}，{E3，E3}，{E4 ，E4，E4}。
你会选择什么数据结构和算法？另请参阅设置数据结构的费用，比如，如果是一个预先排序的，如std :: multimap

更新

根据建议使事情更清晰。有一个约束：元素必须自己比较，以确定它们是重复的。所以哈希不适用，因为实际上它们将比较从重元素（例如数据块）转移到轻元素（整数），并减少了一些比较，但是不要离开他们，最后我们回到我们原来的问题，什么时候在一个碰撞桶里。假设你有一堆潜在的GB的重复文件，他们拥有相同的哈希值，每个哈希算法人都知道。现在你要找到真正的重复。

不，它不能是一个现实生活中的问题（甚至MD5就足以为现实生成独特的哈希文件）。但只是假装，以便我们可以专注于找到涉及最少比较的数据结构+算法。

我正在做的是

表示为STL std :: list数据结构（在那1 ）它的元素删除比例如向量更便宜2）它的插入便宜，不需要排序。）

弹出一个元素并比较它其余的，如果发现重复，则将其从列表中拉出。一旦达到列表的结尾，就会发现一组重复（如果有的话）。

重复上述2个步骤，直到列表为空。 p>

在最好的情况下需要N-1，但是（N-1）！

$ b

p>我的代码使用上述方法：

  //使用std :: list容器的算法，
 / / supports：list< path_type>，list< pair< std :: string，paths_type :: const_iterater>> 
 template< class T> 
 struct consume_list 
 {
 groups_type operator（）（list< T>& l）
 {
 //删除虚假相同并分组其余
 // algorithm：
 // 1.将第一个元素与其余元素进行比较，
 //选出所有重复的文件，包括第一个元素本身。 
 // 2.再次使用缩小的列表
 //重新开始，直到列表包含一个或零个元素。 
 
 groups_type sub_groups; 
 group_type one_group; 
 one_group.reserve（1024）; 
 
 while（l.size（）> 1）
 {
 T front（l.front（））; 
 l.pop_front（）; 
 
 item_predicate< T> EP（前）; 
 list< T> :: iterator it = l.begin（）; 
 list< T> :: iterator it_end = l.end（）; 
 while（it！= it_end）
 {
 if（ep.equals（* it））
 {
 one_group.push_back（ep.extract_path（* ）））; //单出来
 it = l.erase（it）; 
} 
 else 
 {
 it ++; 
} 
} 
 
 //保存结果
如果（！one_group.empty（））
 {
 // save 
 one_group.push_back（ep.extract_path（front））; 
 sub_groups.push_back（one_group）; 
 
 //清除，内存分配未释放
 one_group.clear（）; 
} 
} 
 return sub_groups; 
} 
}; 
 
 
 //在stl容器内的项目项目比较类型，例如std :: list 
 template< class T> 
 struct item_predicate {}; 
 
 //类型为path_type的特殊化
 template<> 
 struct item_predicate< path_type> 
 {
 public：
 item_predicate（const path_type& base）/ * init list * / 
 {} 
 public：
 bool equals（const path_type& comparee）
 {
 bool result; 
 / *这里耗时的操作* / 
返回结果; 
} 
 
 const path_type& extract_path（const path_type& p）
 {
 return p; 
} 
 private：
 //类成员
}; 
 
 
};

感谢下面的答案，被我的例子误导为整数。实际上，这些元素是不可知的类型（不一定是整数，字符串或任何其他POD），而相等的谓词是自定义的，那就是比较可能很重 。

所以也许我的问题应该是：使用哪种数据结构+算法涉及较少的比较。

根据我的测试，使用预分类容器，如multiset，multimap不是更好，因为

做比较，

以下相邻发现再次进行比较，

这些数据结构优于操作至等于操作，它们执行2次-than（a

我没有看到如何保存比较。

还有一件事情被下面的一些答案所忽略，我需要将重复的组彼此区分开，而不仅仅是将它们保留在容器中。

结论

所有讨论结束后，似乎有3种方式

我原来的天真的方法，如上所述

从线性con像 std :: vector 一样，排序，然后找到相等范围

从相关的容器开始，如 std :: map< Type，vector< duplicateates>> ，在由Charles Bailey建议的关联容器设置期间选出重复项。

我已经编写了一个示例来测试下面发布的所有方法。

重复次数及分发次数可能会影响最佳选择。

方法1最好当他们在前面严重倒塌时，最糟糕的是最后。

方法3具有最高的性能

方法2永远不是最佳选择

感谢所有参与讨论的人。

一个输出与20个样品

用[20 10 6 5 4 3 2 2 2 2 1
1 1 1 1 1 1 1 1 1]

和[1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
4 5 6 10 20] p>

使用std :: vector - > sort（） - >
adjacent_find（）：

比较： ['<'= 139，'=='= 23
]

比较：['<'= 38，'=='= 23 ]

使用std :: list - > sort（） - >收缩
列表：

比较：['<'= 50，'=='= 43]

比较：['<'= 52，'=='= 43] / p>

使用std :: list - > shrink列表：

比较：['<'= 0，'=='= 121]

比较：['<'= 0，'=='= 43]

使用std :: vector - > std :: map>：

比较：['<'= 79，'=='= 0]

比较：['<'= 53 ，'=='= 0]

使用std :: vector - >
std :: multiset - >
adjacent_find（） p>

比较：['<'= 79，'=='= 7]

比较：['< ;'= 53，'=='= 7]

代码

< p $ p> //使用VC ++编译10：cl.exe / EHsc #include< vector> #include< deque> #include< list> #include< map> #include< set> #include< algorithm> #include< iostream> #include< sstream> #include< boost / foreach.hpp> #include< boost / tuple / tuple.hpp> #include< boost / format.hpp> 使用namespace std; struct类型 { 类型（int i）：m_i（i）{} bool operator<（const Type& t）const { ++ number_less_than_comparison; return m_i< t.m_i; } bool operator ==（const Type& t）const { ++ number_equal_comparison; return m_i == t.m_i; } public： static void log（const string& operation） { cout < <操作<<< ：[ < '<'=<< number_less_than_comparison << ， < '=='=<< number_equal_comparison << ] \\\ ; reset（）; } int to_int（）const { return m_i; } private： static void reset（） { number_less_than_comparison = 0; number_equal_comparison = 0; } public： static size_t number_less_than_comparison; static size_t number_equal_comparison; private： int m_i; }; size_t类型:: number_less_than_comparison = 0; size_t Type :: number_equal_comparison = 0; ostream&运算符<（ostream& os，const Type& t） { os< t.to_int（）; return os; } 模板<集装箱> struct Test { void recursive_run（size_t n） { bool reserve_order = false; for（size_t i = 48; i< n; ++ i） { run（i）; } } void run（size_t i） { cout<< boost :: format（\\\ \\\ Test％1％sample elements\\\ using method％2％：\\\ ）％i ％说明（）; generate_sample（i）; sort（）; locate（）; generate_reverse_sample（i）; sort（）; locate（）; } private： void print_me（const string& when） { std :: stringstream ss; ss<<当<= [; BOOST_FOREACH（const Container :: value_type& v，m_container） { ss< v< ; } ss<< ] \\\ ; cout<<< ss.str（）; } void generate_sample（size_t n） { m_container.clear（）; for（size_t i = 1; i <= n; ++ i） { m_container.push_back（Type（n / i））; } print_me（init value）; Type :: log（setup）; } void generate_reverse_sample（size_t n） { m_container.clear（）; for（size_t i = 0; i { m_container.push_back（Type（n /（n-i）））; } print_me（init value（reverse order））; Type :: log（setup）; } void sort（） { sort_it（）; 类型:: log（sort）; print_me（after sort）; } void locate（） { locate_duplicates（）; 类型:: log（locate duplicate）; } protected：虚拟字符串描述（）= 0; virtual void sort_it（）= 0; virtual void locate_duplicates（）= 0; protected：容器m_container; }; struct Vector：Test< vector< Type> > { string Description（） { returnstd :: vector< Type> - > sort（） - > adjacent_find（）; } private： void sort_it（） { std :: sort（m_container.begin（），m_container.end（））; } void locate_duplicates（） { using std :: adjacent_find; typedef vector< Type> :: iterator ITR; typedef vector< Type> :: value_type VALUE; typedef boost :: tuple< VALUE，ITR，ITR>元组; typedef vector< TUPLE> V_TUPLE; V_TUPLE结果; ITR itr_begin（m_container.begin（））; ITR itr_end（m_container.end（））; ITR itr（m_container.begin（））; ITR itr_range_begin（m_container.begin（））; while（itr_begin！= itr_end） { //找到一个相等的开头 itr = adjacent_find（ itr_begin， itr_end， []（VALUE& v1，VALUE& v2） { return v1 == v2; } ）; if（itr_end == itr）break; //容器结束 //找到一个相等的结尾 VALUE start = * itr; while（itr！= itr_end） { if（！（* itr == start））break; itr ++; } results.push_back（TUPLE（start，itr_range_begin，itr））; //准备下一个迭代 itr_begin = itr; } } }; struct List：Test< list< Type> > { 列表（bool sorted）：m_sorted（sorted）{} string描述（） { return m_sorted？ std :: list - > sort（） - > shrink list：std :: list - > shrink list; } private： void sort_it（） { if（m_sorted）m_container.sort（）; //// std :: sort（m_container.begin （），m_container.end（））; } void locate_duplicates（） { typedef list< Type> :: value_type VALUE; typedef list< Type> :: iterator ITR; typedef vector< VALUE>组; typedef向量< GROUP>组; GROUPS sub_groups; GROUP one_group; while（m_container.size（）> 1） { VALUE front（m_container.front（））; m_container.pop_front（）; ITR it = m_container.begin（）; ITR it_end = m_container.end（）; while（it！= it_end） { if（front ==（* it）） { one_group.push_back（* it）; //单独出来 it = m_container.erase（it）; // shrink list by one } else { it ++; } } //保存结果如果（！one_group.empty（）） { // save one_group.push_back（front）; sub_groups.push_back（one_group）; //清除，内存分配未释放 one_group.clear（）; } } } private： bool m_sorted; }; struct Map：Test< vector< Type>> { string Description（） { returnstd :: vector - > std :: map< Type，vector< Type>> ; } private： void sort_it（）{} void locate_duplicates（） { typedef map<类型，向量& ; >地图; typedef MAP :: iterator ITR; MAP local_map; BOOST_FOREACH（const vector< Type> :: value_type& v，m_container） { pair< ITR，bool>麻省理工学院; mit = local_map.insert（make_pair（v，vector< Type>（1，v）））; if（！mit.second）（mit.first-> second）.push_back（v）; } ITR itr（local_map.begin（））; while（itr！= local_map.end（）） { if（itr-> second.empty（））local_map.erase（itr）; itr ++; } } }; struct Multiset：Test< vector< Type>> { string描述（） { 返回std :: vector - > std :: multiset< Type> - > adjacent_find（）; } private： void sort_it（）{} void locate_duplicates（） { using std :: adjacent_find; typedef set< Type>组; typedef SET :: iterator ITR; typedef SET :: value_type VALUE; typedef boost :: tuple< VALUE，ITR，ITR>元组; typedef vector< TUPLE> V_TUPLE; V_TUPLE结果; SET local_set; BOOST_FOREACH（const vector< Type> :: value_type& v，m_container） { local_set.insert（v）; } ITR itr_begin（local_set.begin（））; ITR itr_end（local_set.end（））; ITR itr（local_set.begin（））; ITR itr_range_begin（local_set.begin（））; while（itr_begin！= itr_end） { //找到一个相等的开头 itr = adjacent_find（ itr_begin， itr_end， []（VALUE& v1，VALUE& v2） { return v1 == v2; } ）; if（itr_end == itr）break; //容器结束 //找到一个相等的结尾 VALUE start = * itr; while（itr！= itr_end） { if（！（* itr == start））break; itr ++; } results.push_back（TUPLE（start，itr_range_begin，itr））; //准备下一个迭代 itr_begin = itr; } } }; int main（） { size_t N = 20; Vector（）。run（20）; 列表（true）.run（20）; 列表（false）.run（20）; Map（）。run（20）; Multiset（）。run（20）; }

解决方案

其他元素的列表/向量/ deque的代表元素。这需要在插入容器中相对较少的比较，意味着您可以遍历所得到的组而不必执行任何比较。

此示例始终插入第一个代表元素进入映射的deque存储，因为它使得后续的迭代通过组在逻辑上很简单，但是如果这个重复证明是一个问题，那么只需执行 push_back 只有 if（！ins_pair.second）。

  typedef std :: map< Type ，std :: deque< Type> >存储; 
 
 void插入（存储& s，const类型&）
 {
 std :: pair< Storage :: iterator，bool> ins_pair（s.insert（std :: make_pair（t，std :: deque< Type>（））））; 
 ins_pair.first-> second.push_back（t）; 
}

然后（相对）简单和便宜地迭代组：

  void Iterate（const Storage& s）
 {
 for（Storage :: const_iterator i = s.begin（ ）; i！= s.end（）; ++ i）
 {
 for（std :: deque< Type> :: const_iterator j = i-> second.begin（）; j！ = i-> second.end（）; ++ j）
 {
 //做某事与* j 
} 
} 
} 
我进行了一些实验比较和对象计数。在测试中，随机订单中有100000个对象组成50000个组（即平均每组2个对象），上述方法花费以下数量的比较和副本：
  1630674比较，443290份
  
（我尝试带上副本但是仅仅是以牺牲比较为代价，这似乎是您的场景中较高的成本操作。）
 
 
 使用多重映射，并将之前的元素保留在检测组转换的最终迭代成本为：
  1756208比较，100000份
  
使用单个列表并弹出前端元素并对其他群组成员执行线性搜索cost：
  1885879088比较，100000份
  
是的， 〜1.9b比较相对于我最好的方法〜1.6m。为了使列表方法在最佳比较数量附近执行任何操作，它必须进行排序，并且首先需要构建类似数量的比较来构建固有顺序的容器。
 
 
  编辑 
 
 
 我把你的贴子代码运行了隐含的算法（我不得不对代码与某些假定的定义一样）在与之前使用的相同的测试数据集中，我计数：
  1885879088比较，420939副本
  
 ie与我的愚蠢列表算法完全相同的比较数量，但更多的副本。我认为这意味着我们在这种情况下使用基本相似的算法。我看不到有其他排序顺序的证据，但是看起来你想要一个包含多个等效元素的组列表。这可以通过在中添加if（i-> size> 1）子句在 Iterate 函数中简单地实现
 
 
 我仍然看不到任何证据表明构建一个排序的容器，如这个deques的映射不是一个好的（即使不是最佳的）策略。 p> 
Say you have a collection of elements,  how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++,  but algorithm is more important than the language.
For Example
given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}.
what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates. 

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket. 

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.



What I am doing is to 

represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than,   say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2  steps until the list is empty.
It needs N-1 in the best case, but (N-1)! in the worse case.

what are the better alternatives?



My code using method explained above:
// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};




Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs),  and the equal predicates are self-defined,  that is the comparison can be very heavy. 

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

Using a pre-sorted container like multiset, multimap is not better according to my test, since 

the sorting while inserting already does the comparisons,
the following adjacent finding does comparison again, 
these data structure prefer less-than operations to equal operations, they perform 2 less-than(a
I do not see how it can save comparisons. 



one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container. 



Conclusion

After all the discussion,  there seem to be 3 ways

my original naive method as explained above
Start with a linear container like std::vector , sort it and then locate the equal ranges
start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.
I've coded a sample to test all the methods as posted below. 

the number of duplicates and when they are distributed may influence the best choice.  


Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
Method 3 has the most average performance
Method 2 is never the best choice


Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

  Test with   [ 20 10 6 5 4 3 2 2 2 2 1
  1 1 1 1 1 1 1 1 1 ]
  
  and   [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3
  4 5 6 10 20 ] respectively
  
  using std::vector -> sort() ->
  adjacent_find():
  
  comparisons: [ '<'  = 139,  '==' = 23
  ]
  
  comparisons: [ '<'  = 38, '==' = 23 ]
  
  using std::list -> sort() -> shrink
  list:
  
  comparisons: [ '<'  = 50, '==' = 43 ]
  
  comparisons: [ '<'  = 52, '==' = 43 ]
  
  using std::list -> shrink list:
  
  comparisons: [ '<'  = 0, '==' = 121 ]
  
  comparisons: [ '<'  = 0, '==' = 43 ]
  
  using std::vector -> std::map>:
  
  comparisons: [ '<'  = 79, '==' = 0 ]
  
  comparisons: [ '<'  = 53, '==' = 0 ]
  
  using std::vector ->
  std::multiset ->
  adjacent_find():
  
  comparisons: [ '<'  = 79, '==' = 7 ]
  
  comparisons: [ '<'  = 53, '==' = 7 ]
  
  Code


// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

 解决方案 
You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).
typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}
Iterating through the groups is then (relatively) simple and cheap:
void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}
I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:
1630674 comparisons, 443290 copies
(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:
1756208 comparisons, 100000 copies
Using a single list and popping the front element and performing a linear search for other group members cost:
1885879088 comparisons, 100000 copies
Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:
1885879088 comparisons, 420939 copies
i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

                        这篇关于什么是快速算法在集合中查找重复元素并将其分组？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

什么是快速算法在集合中查找重复元素并将其分组？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

问题描述

更新

结论

感谢所有参与讨论的人。

代码

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

什么是快速算法在集合中查找重复元素并将其分组？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

问题描述

更新

结论

感谢所有参与讨论的人。

代码

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭