大字典创作需要很多时间。 [英] large dictionary creation takes a LOT of time.

查看:64
本文介绍了大字典创作需要很多时间。的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里的代码:

def wordcount(行):

for i in range(len(lines)/ 8):

words = lines [i] .split("")

if not locals()。has_key(" frequency"):

frequency = {}

单词中的单词:

如果frequency.has_key(单词):

frequency [word] + = 1

否则:

频率[字] = 1

返回频率

wordcount(行)


花费超过六分钟来运行一个2兆字节的文本文件。我b $ b意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?

在我看来,这应该线性扩展,但也许它不是't?

i对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的详细信息,那么我将非常感激。


行应该是提供的行列表by file.readline()

解决方案

" possiblebox" < PO ************ @ gmail.com>写道:

此代码在这里:

def wordcount(行):
for i in range(len(lines)/ 8):
words = lines [i] .split("")
if if locals()。has_key(" frequency"):
frequency = {}
单词:
如果frequency.has_key(单词):
频率[字] + = 1
否则:
频率[字] = 1
返回频率 wordcount(行)

需要花费六分钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?


2MB不是那么大。你的方法还可以,不应该那么慢

,除非你的PC很慢。您的机器可能缺少内存和分页吗?您可以稍微调整代码,方法是将频率字典的初始化移出循环并将其他语句组合起来。另外你应该使用xrange而不是range,

来避免在内存中分配一个大的列表:


def wordcount(行):

frequency = {}
$ x $ b for x in xrange(len(lines)/ 8):

for line [i] .split():

频率[字] = 1 + frequency.get(字,0)

返回频率

wordcount(行)

在我看来,这应该线性扩展,但也许它不是吗?
我对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的细分,那么我将非常感激。




它应该接近线性,或者最坏n log n,取决于当元素数量增加时必须放大dicts时发生的


为什么你只处理1/8行?


possiblebox写道:

此代码在这里:

def wordcount(lines):
for i in range(len(lines)/ 8):
words = lines [i] .split("")
if if locals()。has_key(" frequency"):
频率= {}
用于单词中的单词:
如果frequency.has_key(单词):
频率[单词] + = 1
否则:
频率[单词] = 1
返回频率
wordcount(行)

花了六分多钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?
在我看来,这应该是线性缩放,但也许它不是吗?
我对算法的复杂性了解不多。如果有人能够给出这个函数复杂性的细分,那么我将非常感激。

行应该是file.readline()提供的行列表。 / blockquote>


这是一个更清洁的版本。在我的电脑上运行大约需要一秒钟。你用什么硬件




path =''DonQuixote.txt''


frequency = {}


for line in open(path):

for line in line.split():

if frequency。 has_key(字):

频率[字] + = 1

否则:

频率[字] = 1


print len(频率),''words''

Kent


>>>>> ; "肯特" == Kent Johnson< ke **** @ tds.net>写道:


肯特> if frequency.has_key(word):

Kent>频率[字] + = 1

肯特>否则:

肯特>频率[字] = 1


这是一个使用''get''方法的好地方:


频率[字] = frequency.get(word,0)+ 1


-

Ville Vainio http://tinyurl.com/2prnb


this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.

lines is expected to be a list of lines as provided by file.readline()

解决方案

"possibilitybox" <po************@gmail.com> writes:

this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn''t be that slow
unless you''re on a pretty slow PC. Could your machine be short of
memory and paging a lot? You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements. Also you should use xrange instead of range,
to avoid allocating a big list in memory:

def wordcount(lines):
frequency = {}
for i in xrange(len(lines)/8):
for word in lines[i].split():
frequency[word] = 1 + frequency.get(word, 0)
return frequency
wordcount(lines)
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.



It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?


possibilitybox wrote:

this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.

lines is expected to be a list of lines as provided by file.readline()



Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
running on?

path = ''DonQuixote.txt''

frequency = {}

for line in open(path):
for word in line.split():
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1

print len(frequency), ''words''
Kent


>>>>> "Kent" == Kent Johnson <ke****@tds.net> writes:

Kent> if frequency.has_key(word):
Kent> frequency[word] += 1
Kent> else:
Kent> frequency[word] = 1

This is a good place to use ''get'' method of dict:

frequency[word] = frequency.get(word,0) + 1

--
Ville Vainio http://tinyurl.com/2prnb


这篇关于大字典创作需要很多时间。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆