减少“用于大数据的循环”并做出改进 [英] Reduce "for loops for big data" and make improvement
问题描述
首先,代码如下:
#lemmas是由大约20,000个单词组成的列表。
#就是说,lemmas = ['apple','dog',...]
#new_sents是由大约12,000个列表组成的列表。
#就是说,new_sents = [['Hello','I','am','a','boy'],['Hello','I','am','a', '女孩'],...]
在引理句中的x:
在引理句中的y:
#防止零分母
x_count = 0.00001
y_count = 0.00001
xy_count = 0
##骰子分母
for new_sents:
x_count + = i.count(x)
y_count + = i.count(y)
if(x in i and y in i):
xy_count + = 1
sim_score = float(xy_count)/( x_count + y_count)
正如您所看到的,有很多迭代..大约20,000 * 20,000 * 12,000,这个数字太大了。
sim_score是两个单词的骰子系数。
也就是说,xy_count表示单词x和单词y在句子中出现的数量,x_count和y_count分别表示new_在字母x和y中显示的总数。
我让我的代码太慢了。
有没有更好的方法?
在此先感谢。
你的分数在x和y上是对称的,所以你可以通过这样做得到2倍的加速:
pre> y在itertools.combinations(引理2):
我假设你不想比较 lemmas [0]
自己,否则可以使用 combinations_with_replacement
。
如果从一个集合中查找 lemmas
,那么执行速度会更快。
多次计算相同的东西。你可以把每个引理都计算在 news_sent
中并存储它。
I am trying to make this code (that I made) as faster as possible. First, the code is as follows
#lemmas is list consisting of about 20,000 words.
#That is, lemmas = ['apple', 'dog', ... ]
#new_sents is list consisting of about 12,000 lists representing a sentence.
#That is, new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl'], ... ]
for x in lemmas:
for y in lemmas:
# prevent zero denominator
x_count = 0.00001
y_count = 0.00001
xy_count = 0
## Dice denominator
for i in new_sents:
x_count += i.count(x)
y_count += i.count(y)
if(x in i and y in i):
xy_count += 1
sim_score = float(xy_count) / (x_count + y_count)
As you can see, there are so many iterations.. about 20,000 * 20,000 * 12,000, which are too big numbers. sim_score is Dice coeffient of two words. That is, xy_count means the number of word x and word y appeared together in the sentence and x_count and y_count mean the total number of word x and y shown in new_sents respectively.
I made my code which is too slow. Is there any better way?
Thanks in advance.
You are computing each thing twice. Your score is symmetrical in x and y, so you can get a 2-fold speed up by doing this:
for x, y in itertools.combinations(lemmas, 2):
I am assuming you don't want to compare lemmas[0]
with itself, otherwise you can use combinations_with_replacement
.
The implementation will be faster if you look up lemmas
from a set.
But you are still computing the same thing several times. You can take each lemma, count it in news_sent
and store it.
这篇关于减少“用于大数据的循环”并做出改进的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!