大字典创作需要很多时间。 [英] large dictionary creation takes a LOT of time.
问题描述
这里的代码:
def wordcount(行):
for i in range(len(lines)/ 8):
words = lines [i] .split("")
if not locals()。has_key(" frequency"):
frequency = {}
单词中的单词:
如果frequency.has_key(单词):
frequency [word] + = 1
否则:
频率[字] = 1
返回频率
wordcount(行)
花费超过六分钟来运行一个2兆字节的文本文件。我b $ b意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?
在我看来,这应该线性扩展,但也许它不是't?
i对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的详细信息,那么我将非常感激。
行应该是提供的行列表by file.readline()
" possiblebox" < PO ************ @ gmail.com>写道:
此代码在这里:
def wordcount(行):
for i in range(len(lines)/ 8):
words = lines [i] .split("")
if if locals()。has_key(" frequency"):
frequency = {}
单词:
如果frequency.has_key(单词):
频率[字] + = 1
否则:
频率[字] = 1
返回频率> wordcount(行)
需要花费六分钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?
2MB不是那么大。你的方法还可以,不应该那么慢
,除非你的PC很慢。您的机器可能缺少内存和分页吗?您可以稍微调整代码,方法是将频率字典的初始化移出循环并将其他语句组合起来。另外你应该使用xrange而不是range,
来避免在内存中分配一个大的列表:
def wordcount(行):
frequency = {}
$ x $ b for x in xrange(len(lines)/ 8):
for line [i] .split():
频率[字] = 1 + frequency.get(字,0)
返回频率
wordcount(行)
在我看来,这应该线性扩展,但也许它不是吗?
我对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的细分,那么我将非常感激。
它应该接近线性,或者最坏n log n,取决于当元素数量增加时必须放大dicts时发生的
。
为什么你只处理1/8行?
possiblebox写道:此代码在这里:
def wordcount(lines):
for i in range(len(lines)/ 8):
words = lines [i] .split("")
if if locals()。has_key(" frequency"):
频率= {}
用于单词中的单词:
如果frequency.has_key(单词):
频率[单词] + = 1
否则:
频率[单词] = 1
返回频率
wordcount(行)
花了六分多钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件,一个非常大的文件(它实际上是唐吉诃德的全文)。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数?
在我看来,这应该是线性缩放,但也许它不是吗?
我对算法的复杂性了解不多。如果有人能够给出这个函数复杂性的细分,那么我将非常感激。
行应该是file.readline()提供的行列表。 / blockquote>
这是一个更清洁的版本。在我的电脑上运行大约需要一秒钟。你用什么硬件
?
path =''DonQuixote.txt''
frequency = {}
for line in open(path):
for line in line.split():
if frequency。 has_key(字):
频率[字] + = 1
否则:
频率[字] = 1
print len(频率),''words''
Kent
>>>>> ; "肯特" == Kent Johnson< ke **** @ tds.net>写道:
肯特> if frequency.has_key(word):
Kent>频率[字] + = 1
肯特>否则:
肯特>频率[字] = 1
这是一个使用''get''方法的好地方:
频率[字] = frequency.get(word,0)+ 1
-
Ville Vainio http://tinyurl.com/2prnb
this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)
is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.
lines is expected to be a list of lines as provided by file.readline()
解决方案"possibilitybox" <po************@gmail.com> writes:
this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)
is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn''t be that slow
unless you''re on a pretty slow PC. Could your machine be short of
memory and paging a lot? You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements. Also you should use xrange instead of range,
to avoid allocating a big list in memory:
def wordcount(lines):
frequency = {}
for i in xrange(len(lines)/8):
for word in lines[i].split():
frequency[word] = 1 + frequency.get(word, 0)
return frequency
wordcount(lines)
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.
It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?
possibilitybox wrote:this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)
is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.
lines is expected to be a list of lines as provided by file.readline()
Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
running on?
path = ''DonQuixote.txt''
frequency = {}
for line in open(path):
for word in line.split():
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
print len(frequency), ''words''
Kent
>>>>> "Kent" == Kent Johnson <ke****@tds.net> writes:
Kent> if frequency.has_key(word):
Kent> frequency[word] += 1
Kent> else:
Kent> frequency[word] = 1
This is a good place to use ''get'' method of dict:
frequency[word] = frequency.get(word,0) + 1
--
Ville Vainio http://tinyurl.com/2prnb
这篇关于大字典创作需要很多时间。的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!