在 Python 中一次遍历字符串单词 [英] Iterating through String word at a time in Python

查看:40
本文介绍了在 Python 中一次遍历字符串单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大文本文件的字符串缓冲区.我必须在字符串缓冲区中搜索给定的单词/短语.什么是有效的方法?

I have a string buffer of a huge text file. I have to search a given words/phrases in the string buffer. Whats the efficient way to do it ?

我尝试使用 re 模块匹配.但由于我有一个庞大的文本语料库,我必须搜索.这需要大量时间.

I tried using re module matches. But As i have a huge text corpus that i have to search through. This is taking large amount of time.

给定一个单词和短语词典.

Given a Dictionary of words and Phrases.

我遍历每个文件,将其读入 string ,搜索字典中的所有单词和短语,如果找到键,则增加字典中的计数.

I iterate through the each file, read that into string , search all the words and phrases in the dictionary and increment the count in the dictionary if the keys are found.

我们认为的一个小优化是将词组/词的字典排序,词数最多到最少.然后比较字符串缓冲区中每个单词的起始位置并比较单词列表.如果找到一个短语,我们不搜索其他短语(因为它匹配最长的短语,这就是我们想要的)

One small optimization that we thought was to sort the dictionary of phrases/words with the max number of words to lowest. And then compare each word start position from the string buffer and compare the list of words. If one phrase is found, we don search for the other phrases (as it matched the longest phrase ,which is what we want)

有人可以建议如何在字符串缓冲区中逐字处理.(逐字迭代字符串缓冲区)?

Can some one suggest how to go about word by word in the string buffer. (Iterate string buffer word by word) ?

另外,有没有其他可以优化的地方?

Also, Is there any other optimization that can be done on this ?

data = str(file_content)
for j in dictionary_entity.keys():
    cnt = data.count(j+" ")
    if cnt != -1:
        dictionary_entity[j] = dictionary_entity[j] + cnt
f.close()

推荐答案

逐字迭代文件的内容(在我的例子中是古腾堡计划中的绿野仙踪),三种不同的方式:

Iterating word-by-word through the contents of a file (the Wizard of Oz from Project Gutenberg, in my case), three different ways:

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

def word_iter_std(filename):
    start = time.time()
    with open(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print 'iter_std took %0.6f seconds' % (time.time() - start)

def word_iter_re(filename):
    start = time.time()
    with open(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print 'iter_re took %0.6f seconds' % (time.time() - start)

def word_iter_stringio(filename):
    start = time.time()
    with open(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print 'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'

for word in word_iter_std(woo): pass
for word in word_iter_re(woo): pass
for word in word_iter_stringio(woo): pass

结果:

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds

这篇关于在 Python 中一次遍历字符串单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆