减少“用于大数据的循环”并做出改进 [英] Reduce "for loops for big data" and make improvement

查看：101 发布时间：2018/1/28 13:12:51 python python-2.7 for-loop

本文介绍了减少“用于大数据的循环”并做出改进的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图使这个代码（我做的）尽可能快。
首先，代码如下：

  #lemmas是由大约20,000个单词组成的列表。 
＃就是说，lemmas = ['apple'，'dog'，...] 
 
 #new_sents是由大约12,000个列表组成的列表。 
＃就是说，new_sents = [['Hello'，'I'，'am'，'a'，'boy']，['Hello'，'I'，'am'，'a'， '女孩']，...] 
 
在引理句中的x：
在引理句中的y：
＃防止零分母
 x_count = 0.00001 
 y_count = 0.00001 
 
 xy_count = 0 
 ##骰子分母
 for new_sents：
 x_count + = i.count（x）
 y_count + = i.count（y）
 
 if（x in i and y in i）：
 xy_count + = 1 
 
 sim_score = float（xy_count）/（ x_count + y_count）

正如您所看到的，有很多迭代..大约20,000 * 20,000 * 12,000，这个数字太大了。
sim_score是两个单词的骰子系数。
也就是说，xy_count表示单词x和单词y在句子中出现的数量，x_count和y_count分别表示new_在字母x和y中显示的总数。

我让我的代码太慢了。
有没有更好的方法？

在此先感谢。

解决方案

你的分数在x和y上是对称的，所以你可以通过这样做得到2倍的加速：

pre> y在itertools.combinations（引理2）：

我假设你不想比较 lemmas [0] 自己，否则可以使用 combinations_with_replacement 。

如果从一个集合中查找 lemmas ，那么执行速度会更快。

多次计算相同的东西。你可以把每个引理都计算在 news_sent 中并存储它。

I am trying to make this code (that I made) as faster as possible. First, the code is as follows

#lemmas is list consisting of about 20,000 words. 
#That is, lemmas = ['apple', 'dog', ... ] 

#new_sents is list consisting of about 12,000 lists representing a sentence. 
#That is, new_sents = [ ['Hello', 'I', 'am', 'a', 'boy'], ['Hello', 'I', 'am', 'a', 'girl'], ... ]   

for x in lemmas:
        for y in lemmas:
            # prevent zero denominator 
            x_count = 0.00001
            y_count = 0.00001

            xy_count = 0
            ## Dice denominator 
            for i in new_sents:
                x_count += i.count(x) 
                y_count += i.count(y)

                if(x in i and y in i):
                    xy_count += 1

            sim_score = float(xy_count) / (x_count + y_count)

As you can see, there are so many iterations.. about 20,000 * 20,000 * 12,000, which are too big numbers. sim_score is Dice coeffient of two words. That is, xy_count means the number of word x and word y appeared together in the sentence and x_count and y_count mean the total number of word x and y shown in new_sents respectively.

I made my code which is too slow. Is there any better way?

Thanks in advance.

解决方案

You are computing each thing twice. Your score is symmetrical in x and y, so you can get a 2-fold speed up by doing this:

for x, y in itertools.combinations(lemmas, 2):

I am assuming you don't want to compare lemmas[0] with itself, otherwise you can use combinations_with_replacement.

The implementation will be faster if you look up lemmas from a set.

But you are still computing the same thing several times. You can take each lemma, count it in news_sent and store it.

这篇关于减少“用于大数据的循环”并做出改进的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

减少“用于大数据的循环”并做出改进 [英] Reduce "for loops for big data" and make improvement

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

减少“用于大数据的循环”并做出改进 [英] Reduce &quot;for loops for big data&quot; and make improvement

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

减少“用于大数据的循环”并做出改进 [英] Reduce "for loops for big data" and make improvement

登录关闭