减少“用于大数据的循环”并做出改进 [英] Reduce "for loops for big data" and make improvement

查看:101
本文介绍了减少“用于大数据的循环”并做出改进的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使这个代码(我做的)尽可能快。
首先,代码如下:

  #lemmas是由大约20,000个单词组成的列表。 
#就是说,lemmas = ['apple','dog',...]

#new_sents是由大约12,000个列表组成的列表。
#就是说,new_sents = [['Hello','I','am','a','boy'],['Hello','I','am','a', '女孩'],...]

在引理句中的x:
在引理句中的y:
#防止零分母
x_count = 0.00001
y_count = 0.00001

xy_count = 0
##骰子分母
for new_sents:
x_count + = i.count(x)
y_count + = i.count(y)

if(x in i and y in i):
xy_count + = 1

sim_score = float(xy_count)/( x_count + y_count)

正如您所看到的,有很多迭代..大约20,000 * 20,000 * 12,000,这个数字太大了。
sim_score是两个单词的骰子系数。
也就是说,xy_count表示单词x和单词y在句子中出现的数量,x_count和y_count分别表示new_在字母x和y中显示的总数。

我让我的代码太慢了。
有没有更好的方法?

在此先感谢。

解决方案

你的分数在x和y上是对称的,所以你可以通过这样做得到2倍的加速:

pre> y在itertools.combinations(引理2):

我假设你不想比较 lemmas [0] 自己,否则可以使用 combinations_with_replacement



如果从一个集合中查找 lemmas ,那么执行速度会更快。

多次计算相同的东西。你可以把每个引理都计算在 news_sent 中并存储它。


I am trying to make this code (that I made) as faster as possible. First, the code is as follows

#lemmas is list consisting of about 20,000 words. 
#That is, lemmas = ['apple', 'dog', ... ] 

#new_sents is list consisting of about 12,000 lists representing a sentence. 
#That is, new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl'], ... ]   

for x in lemmas:
        for y in lemmas:
            # prevent zero denominator 
            x_count = 0.00001
            y_count = 0.00001

            xy_count = 0
            ## Dice denominator 
            for i in new_sents:
                x_count += i.count(x) 
                y_count += i.count(y)

                if(x in i and y in i):
                    xy_count += 1

            sim_score = float(xy_count) / (x_count + y_count)

As you can see, there are so many iterations.. about 20,000 * 20,000 * 12,000, which are too big numbers. sim_score is Dice coeffient of two words. That is, xy_count means the number of word x and word y appeared together in the sentence and x_count and y_count mean the total number of word x and y shown in new_sents respectively.

I made my code which is too slow. Is there any better way?

Thanks in advance.

解决方案

You are computing each thing twice. Your score is symmetrical in x and y, so you can get a 2-fold speed up by doing this:

for x, y in itertools.combinations(lemmas, 2):

I am assuming you don't want to compare lemmas[0] with itself, otherwise you can use combinations_with_replacement.

The implementation will be faster if you look up lemmas from a set.

But you are still computing the same thing several times. You can take each lemma, count it in news_sent and store it.

这篇关于减少“用于大数据的循环”并做出改进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆