什么是快速算法找到一个收集和他们组重复的元素？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

查看：147 发布时间：2015/11/30 14:19:30 c++ algorithm elements duplicate-data

本文介绍了什么是快速算法找到一个收集和他们组重复的元素？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设你有元素的集合，你怎么能挑选出那些有重复，并付诸每组最少的比较？ pferably用C ++ $ P $，但算法比语言更重要。例如给出{E1，E2，E3，E4，E4，E2，E6，E4，E3}，我希望提取出{E2，E2}，{E3，E3}，{E4，E4，E4}。什么样的数据结构和算法，你会选谁？还请包括建立数据结构的成本，比如说，如果它是一个pre排序的人喜欢的std :: multimap中

更新

为了让事情更清晰的建议。有一个约束：元素必须自己进行比较，以肯定他们是重复的。

所以哈希值不适用，因为实际上，他们从重元素（数据如块），以轻元素（整数）的比较转向，并减少了一些比较，而不是与他们无关了，并且在最后，我们又回到我们原来的问题，当其内部一次碰撞桶。

pretend你有一堆潜力的重复GB的文件，每个文件，他们承担相同的散列值由每个哈希算法的人知道。现在，你会发现真正的重复。

没有，它不能成为一个现实生活中的问题（甚至MD5是足以产生唯一哈希用于现实生活中的文件）。但是，仅仅pretend使我们能够重点查找数据结构+算法，涉及最少的比较

我在做什么是

再present成STL的std ::列表的数据结构（即1）其元素的缺失是便宜，说，向量2）它的插入比较便宜，不需要排序。）
蹦出一种元素，并与其余的比较，如果一个重复的被发现，它的拉出列表。一旦该列表的末尾为止，一组复制的是发现，如果有的话。
重复上述2个步骤，直到列表为空。

有需要的N-1在最好的情况下，但（N-1）！在最坏的情况下

有什么更好的方法？

我的code。使用方法上面解释的：

  //算法消耗的std :: list容器，
//支持：列表＆LT; path_type＆GT;，名单，其中，对＆LT;的std ::字符串，paths_type :: const_iterater＆GT;＆GT;
模板＆LT;类T＆GT;
结构consume_list
{
    groups_type运算符（）（表＆LT; T＆GT;＆安培; L）
    {
        //删除虚假identicals和组剩下的
        // 算法：
        第一元件// 1.比较与剩余的元件，
        //挑出所有重复的文件，包括第一个元素本身。
        // 2。重新开始与收缩列表
        //直到列表中包含一个或零个元素。

        groups_type sub_groups;
        group_type one_group;
        one_group.reserve（1024）;

        而（l.size（）→1）
        {
            牛逼前（l.front（））;
            l.pop_front（）;

            ITEM_ predicate＆LT; T＆GT; EP（前）;
            名单＆LT; T＆GT;：迭代它= l.begin（）;
            名单＆LT; T＆GT;：迭代it_end = l.end（）;
            而（它！= it_end）
            {
                如果（ep.equals（*吧））
                {
                    one_group.push_back（ep.extract_path（*（它）））; //挑选出来
                    它= l.erase（它）;
                }
                其他
                {
                    它++;
                }
            }

            //保存结果
            如果（！one_group.empty（））
            {
                // 保存
                one_group.push_back（ep.extract_path（前））;
                sub_groups.push_back（one_group）;

                //清晰，内存分配没有释放
                one_group.clear（）;
            }
        }
        返回sub_groups;
    }
};


//中的STL容器类型的项目，项目的比较，如的std ::名单
模板＆LT;类T＆GT;
结构ITEM_ predicate {};

//专业化类型path_type
模板＆LT;＆GT;
结构ITEM_ predicate＆LT; path_type＆GT;
{
上市：
    ITEM_ predicate（常量path_type和放大器;基地）/ *初始化列表* /
    {}
上市：
    布尔等于（常量path_type和放大器; comparee）
    {
        布尔结果;
        / *耗时这里操作* /
        返回结果;
    }

    常量path_type和放大器; extract_path（常量path_type和放大器; P）
    {
        返回磷;
    }
私人：
    //类成员
};


};

感谢下面的答案，但是他们似乎被我的例子，它是关于整数被误导。事实上元素的类型无关的（不一定是整数，字符串或任何其他的POD），并平等predicates是自定义的，即的比较可以很重。

所以，也许我的问题应该是：在使用该数据结构+算法涉及较少的比较

。

像使用一个多集pre排序的容器，multimap中是不是更好，根据我的测试，因为

进行排序，而插入已经做了比较，
在下面的相邻发现确实比较重，
在这些数据结构preFER小于操作等于操作，他们执行2小于（一

我看不出它如何能保存的比较。

这年代由以下一些答案忽略了一件事，我需要相互区分重复的群体，不只是让他们在容器中。

结论

在所有的讨论中，似乎有3种方式

在我原来的天真上述方法
在开始一个像的std ::矢量，排序，然后找到相同范围的线性容器
在使用类似性病相关容器::地图＆LT;类型，矢量＆lt;重复＆GT;＆GT; ，相关容器的安装过程中挑选出的复印件，建议由查尔斯贝利。

我有codeD的样品为下面贴测试的所有方法。

副本的数目和当它们分布可能影响的最佳选择。

方法1是最好的，当它们落入重金在前面，并且是最严重的，当在末端。排序不会改变的分布，但端。
方法3拥有最平均业绩
2的方法绝不是最好的选择

感谢所有参与讨论谁。

一个输出与下面的code 20样物品。

与[20 10 6 5 4 3 2 2 2 2 1试验   1 1 1 1 1 1 1 1 1]

和[1 1 1 1 1 1 1 1 1 1 2 2 2 2 3   4 5 6 10 20]分别

使用std ::向量 - >排序（） - >   adjacent_find（）：

比较：['＆LT; = 139，'=='= 23   ]

比较：['＆LT; = 38，'=='= 23]

使用std ::列表 - >排序（） - >收缩   清单：

比较：['＆LT; = 50，'=='= 43]

比较：['＆LT; = 52，'=='= 43]

使用std ::列表 - >收缩清单：

比较：['＆LT; = 0，'=='= 121]

比较：['＆LT; = 0，'=='= 43]

使用std ::向量 - >的std ::地图>：

比较：['＆LT; = 79，'=='= 0]

比较：['＆LT; = 53，'=='= 0]

使用std ::向量 - >   性病:: multiset的 - >   adjacent_find（）：

比较：['＆LT; = 79，'=='= 7]

比较：['＆LT; = 53，'=='= 7]

code

  //用VC ++编译10：cl.exe时/ EHSC

＃包括＆LT;载体＆GT;
＃包括＆LT;双端队列＆GT;
＃包括＆LT;列表＆gt;
＃包括＆LT;地图＆GT;
＃包括＆LT;集＆gt;
＃包括＆LT;算法＆GT;
＃包括＆LT;的iostream＆GT;
＃包括＆LT; sstream＆GT;

＃包括＆LT;升压/ foreach.hpp＆GT;
＃包括＆LT;升压/元组/ tuple.hpp＆GT;
＃包括＆LT;升压/ format.hpp＆GT;

使用名字空间std;

结构类型
{
    类型（int i）以：m_i（ⅰ）{}

    布尔运算符＆LT;（常量类型＆amp; T公司）常量
    {
        ++ number_less_than_comparison;
        返回m_i＆LT; t.m_i;
    }

    布尔运算符==（const型＆amp; T公司）常量
    {
        ++ number_equal_comparison;
        返回m_i == t.m_i;
    }
上市：
    静态无效的日志（常量字符串和放大器;操作）
    {
        COUT
        ＆LT;＆LT; ＆LT中比较;＆lt;操作＆LT;＆LT; ：
        ＆LT;＆LT; '＆LT; =＆其中;＆其中; number_less_than_comparison
        ＆LT;＆LT; ，
        ＆LT;＆LT; '=='=＆LT;＆LT; number_equal_comparison
        ＆LT;＆LT; ] \ N的;

        复位（）;
    }

    INT to_int（）const的
    {
        返回m_i;
    }
私人：
    静态无效复位（）
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;
    }

上市：
    静态为size_t number_less_than_comparison;
    静态为size_t number_equal_comparison;
私人：
    诠释m_i;
};

为size_t类型:: number_less_than_comparison = 0;
为size_t类型:: number_equal_comparison = 0;

ostream的＆放大器;运营商的LT;＆LT;（ostream的＆放大器; OS，const型电话电报公司）
{
    OS＆LT;＆LT; t.to_int（）;
    返回操作系统;
}

模板＆LT;一流的集装箱＆GT;
结构测试
{
    无效recursive_run（为size_t N）
    {
        布尔reserve_order = FALSE;

        用于（为size_t i = 48;我n种; ++ I）
        {
            运行（我）;
        }
    }

    无效的run（我的size_t）
    {
        COUT＆LT;＆LT;
        提高::格式（\ñ\ NTEST％1％样本的元素\ n使用方法％2％：\ N）
        ％ 一世
        ％ 描述（）;

        generate_sample（ⅰ）;
        分类（）;
        定位（）;

        generate_reverse_sample（ⅰ）;
        分类（）;
        定位（）;
    }

私人：
    无效print_me（常量字符串和放大器;当）
    {
        性病:: stringstream的SS;
        SS＆LT;＆LT;当＆LT;＆LT;= [;
        BOOST_FOREACH（常量集装箱:: value_type的＆安培; V，m_container）
        {
            SS＆LT;＆LT; V＆LT;＆LT; ;
        }
        SS＆LT;＆LT; ] \ N的;
        COUT＆LT;＆LT; ss.str（）;
    }

    无效generate_sample（为size_t N）
    {
        m_container.clear（）;
        用于（为size_t i = 1; I＆LT; = N; ++ I）
        {
            m_container.push_back（类型（N / I））;
        }
        print_me（初始化值）;
        类型::日志（设置）;
    }

    无效generate_reverse_sample（为size_t N）
    {
        m_container.clear（）;
        为（为size_t I = 0; I＆n种++ⅰ）
        {
            m_container.push_back（类型（N /（N-1）））;
        }
        print_me（初始化值（倒序））;
        类型::日志（设置）;
    }

    无效的sort（）
    {
        sort_it（）;

        类型::日志（排序）;
        print_me（后的排序）;

    }

    无效定位（）
    {
        locate_duplicates（）;

        类型::日志（查找重复）;
    }
保护：
    虚拟串说明（）= 0;
    虚拟无效sort_it（）= 0;
    虚拟无效locate_duplicates（）= 0;
保护：
    集装箱m_container;
};

结构向量：测试＆LT;矢量＆lt;类型和GT; ＆GT;
{
    字符串说明（）
    {
        返回的std ::矢量＆lt;类型和GT;  - ＆GT;的sort（） - ＆GT; adjacent_find（）;
    }

私人：
    无效sort_it（）
    {
        性病::排序（m_container.begin（），m_container.end（））;
    }

    无效locate_duplicates（）
    {
        使用std :: adjacent_find;
        类型定义矢量＆lt;类型和GT;：迭代ITR;
        类型定义矢量＆lt;类型和GT; :: value_type的价值;

        TYPEDEF提振::元组LT;价值，ITR，ITR＆GT;元组;
        的typedef矢量＆lt;元组GT; V_TUPLE;

        V_TUPLE结果;

        ITR itr_begin（m_container.begin（））;
        ITR itr_end（m_container.end（））;
        ITR ITR（m_container.begin（））;
        ITR itr_range_begin（m_container.begin（））;

        而（itr_begin！= itr_end）
        {
            //找到一个平等reange开始
            ITR = adjacent_find（
            itr_begin，
            itr_end，
                []（VALUE＆安培; V1，价值和放大器; V2）
                {
                    返回V1 == v2的;
                }
            ）;
            如果（itr_end == ITR）破; //容器结束

            //找到一个平等reange结束
            开始值= * ITR;
            而（ITR！= itr_end）
            {
                如果突破（（* ITR ==开始）！）;
                ITR ++;
            }

            results.push_back（数组（启动，itr_range_begin，ITR））;

            // prepare的下一个迭代
            itr_begin = ITR;
        }
    }
};

结构清单：测试与LT;列表＆LT;类型＆GT; ＆GT;
{
    列表（布尔排序）：m_sorted（排序）{}

    字符串说明（）
    {
        返回m_sorted？ 的std ::列表 - ＆GT;的sort（） - ＆GT;缩水榜：标准::列表 - ＆GT;缩水榜;
    }
私人：
    无效sort_it（）
    {
        如果（m_sorted）m_container.sort（）; ////的std ::排序（m_container.begin（），m_container.end（））;
    }

    无效locate_duplicates（）
    {
        类型定义列表＆LT;类型＆GT; :: value_type的价值;
        类型定义列表＆LT;类型＆GT;：迭代ITR;

        类型定义矢量＆lt; VALUE＆GT;组;
        的typedef矢量＆lt;组＆gt;组;

        组sub_groups;
        集团one_group;

        而（m_container.size（）→1）
        {
            前值（m_container.front（））;
            m_container.pop_front（）;

            ITR它= m_container.begin（）;
            ITR it_end = m_container.end（）;
            而（它！= it_end）
            {
                如果（前==（*吧））
                {
                    one_group.push_back（*吧）; //挑选出来
                    它= m_container.erase（它）; //一个缩小清单
                }
                其他
                {
                    它++;
                }
            }

            //保存结果
            如果（！one_group.empty（））
            {
                // 保存
                one_group.push_back（前）;
                sub_groups.push_back（one_group）;

                //清晰，内存分配没有释放
                one_group.clear（）;
            }
        }
    }

私人：
    布尔m_sorted;
};

结构图：测试＆LT;矢量＆lt;类型和GT;＆GT;
{
    字符串说明（）
    {
        返回的std ::矢量 - ＆GT;的std ::地图＆LT;类型，矢量＆lt;类型和GT;＆gt;中;
    }
私人：
    空隙sort_it（）{}

    无效locate_duplicates（）
    {
        类型定义地图＆LT;类型，矢量＆lt;类型和GT; ＆GT;地图;
        类型定义地图：迭代ITR;

        MAP local_map;

        BOOST_FOREACH（常量矢量＆lt;类型和GT; :: value_type的＆安培; V，m_container）
        {
            对＆LT; ITR，布尔＆GT;麻省理工学院;
            MIT = local_map.insert（make_pair（ⅴ，矢量＆lt;类型＆GT;（1，V）））;
            如果（mit.second！）（mit.first-＆gt;第二个）.push_back（V）;
         }

        ITR ITR（local_map.begin（））;
        而（ITR！= local_map.end（））
        {
            如果（itr-＆GT; second.empty（））local_map.erase（ITR）;

            ITR ++;
        }
    }
};

结构多集：测试＆LT;矢量＆lt;类型和GT;＆GT;
{
    字符串说明（）
    {
        返回的std ::矢量 - ＆GT;的std :: multiset的＆LT;类型＆GT;  - ＆GT; adjacent_find（）;
    }
私人：
    空隙sort_it（）{}

    无效locate_duplicates（）
    {
        使用std :: adjacent_find;

        类型定义设置＆LT;类型＆GT;组;
        的typedef SET：迭代ITR;
        的typedef SET :: value_type的价值;

        TYPEDEF提振::元组LT;价值，ITR，ITR＆GT;元组;
        的typedef矢量＆lt;元组GT; V_TUPLE;

        V_TUPLE结果;

        SET local_set;
        BOOST_FOREACH（常量矢量＆lt;类型和GT; :: value_type的＆安培; V，m_container）
        {
            local_set.insert（五）;
        }

        ITR itr_begin（local_set.begin（））;
        ITR itr_end（local_set.end（））;
        ITR ITR（local_set.begin（））;
        ITR itr_range_begin（local_set.begin（））;

        而（itr_begin！= itr_end）
        {
            //找到一个平等reange开始
            ITR = adjacent_find（
            itr_begin，
            itr_end，
            []（VALUE＆安培; V1，价值和放大器; V2）
                {
                    返回V1 == v2的;
                }
            ）;
            如果（itr_end == ITR）破; //容器结束

            //找到一个平等reange结束
            开始值= * ITR;
            而（ITR！= itr_end）
            {
                如果突破（（* ITR ==开始）！）;
                ITR ++;
            }

            results.push_back（数组（启动，itr_range_begin，ITR））;

            // prepare的下一个迭代
            itr_begin = ITR;
        }
    }
};

诠释的main（）
{
    为size_t N = 20;

    向量（）运行（20）。
    列表（真）。运行（20）;
    列表（假）。运行（20）;
    。图（）运行（20）;
    多集（）运行（20）。
}

解决方案

您可以使用来自重presentative元素映射到其他元素的列表/矢量/双端队列。这需要相对较少的插入比较入容器并意味着可以通过将得到的基团迭代而无需执行任何比较

这个例子总是插入首重presentative元到映射的双端队列存储，因为它使通过该组的后续迭代逻辑简单，但如果这样的重复证明了一个问题，那么这将是很容易只执行的push_back 仅如果（！ins_pair.second）。

 的typedef的std ::地图＆LT;类型，的std :: deque的＆LT;类型＆GT; ＆GT;存储;

无效插入（存储和放大器; S，常量类型电话电报公司）
{
    的std ::对＆LT;储存：迭代，布尔＆GT; ins_pair（s.insert（的std :: make_pair（T，的std :: deque的＆LT;类型＆GT;（））））;
    ins_pair.first-＆GT; second.push_back（T）;
}

通过迭代组是那么的（相对）简单和便宜的：

 无效迭代（常量存储和放大器; S）
{
    为（库存::为const_iterator I = s.begin（！）; I = s.end（）; ++ I）
    {
        对于（的std :: deque的＆LT;类型＆GT; ::为const_iterator J = I-＆GT; second.begin（）; J = I-＆GT;！second.end（）; ++ j）条
        {
            //做点什么*Ĵ
        }
    }
}

我进行了一些实验进行比较和对象计数。在与以随机顺序形成50000组（即平均每2组对象）100000对象测试上述方法成本以下一些比较和副本：

  1630674对比，443290份

（我想使拷贝数下降，但只有真正设法在比较的费用，这似乎是在方案中成本较高的操作。）

使用多重映射，并保留在最后一次迭代的previous元素检测组的转换费用如下：

  1756208比较，10万册

使用一个单独的列表，并弹出前元素并执行其他组成员线性搜索成本：

  1885879088比较，10万册

是的，这是〜相比〜1.6米为我的最好的方法19亿的比较。要获取列表的方法来进行比较近，将不得不进行排序的最佳数目的任何地方，这是要花费比较的建设将摆在首位的固有有序容器同样数量的。

修改

我把你贴code和运行隐含的算法（我不得不做出对code一些假设，因为一些假定的定义）在相同的测试数据集，因为我以前用过，我算：

  1885879088比较，420939份

即。相同数量的比较我的哑巴名单的算法，但更多的副本。我认为，我们使用基本相同的算法，这种情况下的意思。我看不到的替代排序顺序的任何证据，但它看起来像你想它包含多个等价元素组的列表。这可以在我的迭代功能在添加简单的实现，如果（I-＆GT;尺寸大于1）条款

我现在还不能看到任何证据表明，建立一个有序的容器，如本地图双端的是不是一个很好的（即使不是最优的）战略。

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

What I am doing is to

represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2 steps until the list is empty.

It needs N-1 in the best case, but (N-1)! in the worse case.

what are the better alternatives?

My code using method explained above:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

Using a pre-sorted container like multiset, multimap is not better according to my test, since

the sorting while inserting already does the comparisons,
the following adjacent finding does comparison again,
these data structure prefer less-than operations to equal operations, they perform 2 less-than(a

I do not see how it can save comparisons.

one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

Conclusion

After all the discussion, there seem to be 3 ways

my original naive method as explained above
Start with a linear container like std::vector , sort it and then locate the equal ranges
start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.

I've coded a sample to test all the methods as posted below.

the number of duplicates and when they are distributed may influence the best choice.

Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
Method 3 has the most average performance
Method 2 is never the best choice

Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

using std::vector -> sort() -> adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector -> std::multiset -> adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

解决方案

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

这篇关于什么是快速算法找到一个收集和他们组重复的元素？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是快速算法找到一个收集和他们组重复的元素？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

问题描述

更新

结论

感谢所有参与讨论谁。

code

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

什么是快速算法找到一个收集和他们组重复的元素？ [英] what are the fast algorithms to find duplicate elements in a collection and group them?

问题描述

更新

结论

感谢所有参与讨论谁。

code

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭