更快的方法来计算的项目出现在集数? [英] Faster way to count number of sets an item appears in?

查看:91
本文介绍了更快的方法来计算的项目出现在集数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有书签列表。每个书签都有关键字列表(如HashSet的存储)。我也有一个集中的所有可能的关键字(宇宙)。

I've got a list of bookmarks. Each bookmark has a list of keywords (stored as a HashSet). I also have a set of all possible keywords ("universe").

我想找到出现在最书签的关键字。

I want to find the keyword that appears in the most bookmarks.

我有1356书签与698539关键字的总和,与187358独特的。

I have 1356 bookmarks with a combined total of 698,539 keywords, with 187,358 unique.

如果我遍历宇宙中的每一个关键字和计数它出现在书签的数目,我做254057448检查。这在我的机器需要35秒。

If I iterate through every keyword in the universe and count the number of bookmarks it appears in, I'm doing 254,057,448 checks. This takes 35 seconds on my machine.

该算法是pretty的简单:

The algorithm is pretty simple:

var biggest = universe.MaxBy(kw => bookmarks.Count(bm => bm.Keywords.Contains(kw)));

使用乔恩斯基特的MaxBy

我不知道这是可能的多加快速度了,但有什么我可以做什么?也许并行它在某种程度上?

I'm not sure it's possible to speed this up much, but is there anything I can do? Perhaps parallelize it somehow?

DTB的解决方案,需要双方建立宇宙和寻找最大的元素。就这么简单。

dtb's solution takes under 200 ms to both build the universe and find the biggest element. So simple.

var freq = new FreqDict();
foreach(var bm in bookmarks) {
    freq.Add(bm.Keywords);
}
var biggest2 = freq.MaxBy(kvp => kvp.Value);

FreqDict 只是一个小类我做了建立在词典&LT之上;串,INT和GT;

FreqDict is just a little class I made built on top of a Dictionary<string,int>.

推荐答案

我没有你样的数据我也没有做任何的基准,但我会采取刺伤。这可以改进的一个问题是,大部分的 bm.Keywords.Contains(KW)检查是失误,我觉得这些是可以避免的。最约束是组关键字中的任何一个给定的书签具有(即:它通常比宇宙小得多)。所以我们应该在那个方向,而不是其他方式启动

I don't have your sample data nor have I done any benchmarking, but I'll take a stab. One problem that could be improved upon is that most of the bm.Keywords.Contains(kw) checks are misses, and I think those can be avoided. The most constraining is the set of keywords any one given bookmark has (ie: it will typically be much smaller than universe) so we should start in that direction instead of the other way.

我在想沿着这些路线的东西。内存的要求要高得多,因为我没有什么基准,它可能会比较慢,或者没有帮助,但我只是删除我的答案,如果它不工作了你。

I'm thinking something along these lines. The memory requirement is much higher and since I haven't benchmarked anything, it could be slower, or not helpful, but I'll just delete my answer if it doesn't work out for you.

Dictionary<string, int> keywordCounts = new Dictionary<string, int>(universe.Length);
foreach (var keyword in universe)
{
    keywordCounts.Add(keyword, 0);
}

foreach (var bookmark in bookmarks)
{
    foreach (var keyword in bookmark.Keywords)
    {
        keywordCounts[keyword] += 1;
    }
}

var mostCommonKeyword = keywordCounts.MaxBy(x => x.Value).Key;

这篇关于更快的方法来计算的项目出现在集数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆