python中的快速/优化N-gram实现 [英] Fast/Optimize N-gram implementations in python

查看:38
本文介绍了python中的快速/优化N-gram实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

python 中哪个 ngram 实现最快?

Which ngram implementation is fastest in python?

我试图分析 nltk 与 scott 的 zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):

I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/):

from nltk.util import ngrams as nltkngram
import this, time

def zipngram(text,n=2):
  return zip(*[text.split()[i:] for i in range(n)])

text = this.s

start = time.time()
nltkngram(text.split(), n=2)
print time.time() - start

start = time.time()
zipngram(text, n=2)
print time.time() - start

[出]

0.000213146209717
6.50882720947e-05

在python中生成ngrams有没有更快的实现?

Is there any faster implementation for generating ngrams in python?

推荐答案

一些分析的尝试.我认为使用发电机可以提高这里的速度.但是与对原始版本的轻微修改相比,改进并不明显.但是如果你不是同时需要完整的列表,生成器函数应该会更快.

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don't need the full list at the same time, the generator functions should be faster.

import timeit
from itertools import tee, izip, islice

def isplit(source, sep):
    sepsize = len(sep)
    start = 0
    while True:
        idx = source.find(sep, start)
        if idx == -1:
            yield source[start:]
            return
        yield source[start:idx]
        start = idx + sepsize

def pairwise(iterable, n=2):
    return izip(*(islice(it, pos, None) for pos, it in enumerate(tee(iterable, n))))

def zipngram(text, n=2):
    return zip(*[text.split()[i:] for i in range(n)])

def zipngram2(text, n=2):
    words = text.split()
    return pairwise(words, n)


def zipngram3(text, n=2):
    words = text.split()
    return zip(*[words[i:] for i in range(n)])

def zipngram4(text, n=2):
    words = isplit(text, ' ')
    return pairwise(words, n)


s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
s = s * 10 ** 3

res = []
for n in range(15):

    a = timeit.timeit('zipngram(s, n)', 'from __main__ import zipngram, s, n', number=100)
    b = timeit.timeit('list(zipngram2(s, n))', 'from __main__ import zipngram2, s, n', number=100)
    c = timeit.timeit('zipngram3(s, n)', 'from __main__ import zipngram3, s, n', number=100)
    d = timeit.timeit('list(zipngram4(s, n))', 'from __main__ import zipngram4, s, n', number=100)

    res.append((a, b, c, d))

a, b, c, d = zip(*res)

import matplotlib.pyplot as plt

plt.plot(a, label="zipngram")
plt.plot(b, label="zipngram2")
plt.plot(c, label="zipngram3")
plt.plot(d, label="zipngram4")
plt.legend(loc=0)
plt.show()

对于这个测试数据,zipngram2 和 zipngram3 似乎是最快的.

For this test data, zipngram2 and zipngram3 seems to be the fastest by a good margin.

这篇关于python中的快速/优化N-gram实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆