大字典创作需要很多时间。 [英] large dictionary creation takes a LOT of time.

查看：64 发布时间：2019/6/5 14:53:59 python

本文介绍了大字典创作需要很多时间。的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里的代码：

def wordcount（行）：

for i in range（len（lines）/ 8）：

words = lines [i] .split（""）

if not locals（）。has_key（" frequency"）：

frequency = {}

单词中的单词：

如果frequency.has_key（单词）：

frequency [word] + = 1

否则：

频率[字] = 1

返回频率

wordcount（行）

花费超过六分钟来运行一个2兆字节的文本文件。我b $ b意识到'这是一个很大的文本文件，一个非常大的文件（它实际上是唐吉诃德的全文）。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数？

在我看来，这应该线性扩展，但也许它不是't？

i对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的详细信息，那么我将非常感激。

行应该是提供的行列表by file.readline（）

解决方案

" possiblebox" < PO ************ @ gmail.com>写道：

此代码在这里：

def wordcount（行）：
for i in range（len（lines）/ 8）：
words = lines [i] .split（""）
if if locals（）。has_key（" frequency"）：
frequency = {}
单词：
如果frequency.has_key（单词）：
频率[字] + = 1
否则：
频率[字] = 1
返回频率 wordcount（行）

需要花费六分钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件，一个非常大的文件（它实际上是唐吉诃德的全文）。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数？

2MB不是那么大。你的方法还可以，不应该那么慢

，除非你的PC很慢。您的机器可能缺少内存和分页吗？您可以稍微调整代码，方法是将频率字典的初始化移出循环并将其他语句组合起来。另外你应该使用xrange而不是range，

来避免在内存中分配一个大的列表：

def wordcount（行）：

frequency = {}
$ x $ b for x in xrange（len（lines）/ 8）：

for line [i] .split（）：

频率[字] = 1 + frequency.get（字，0）

返回频率

wordcount（行）

在我看来，这应该线性扩展，但也许它不是吗？
我对算法的复杂性了解不多。如果有人能够给出这个功能复杂性的细分，那么我将非常感激。

它应该接近线性，或者最坏n log n，取决于当元素数量增加时必须放大dicts时发生的
。

为什么你只处理1/8行？

possiblebox写道：
此代码在这里：

def wordcount（lines）：
for i in range（len（lines）/ 8）：
words = lines [i] .split（""）
if if locals（）。has_key（" frequency"）：
频率= {}
用于单词中的单词：
如果frequency.has_key（单词）：
频率[单词] + = 1
否则：
频率[单词] = 1
返回频率
wordcount（行）

花了六分多钟才能在一个2兆字节的文本文件上运行。我意识到'这是一个很大的文本文件，一个非常大的文件（它实际上是唐吉诃德的全文）。我想弄清楚怎么样。有没有更好的方法让我对文本中的所有单词进行频率计数？
在我看来，这应该是线性缩放，但也许它不是吗？
我对算法的复杂性了解不多。如果有人能够给出这个函数复杂性的细分，那么我将非常感激。

行应该是file.readline（）提供的行列表。 / blockquote>

这是一个更清洁的版本。在我的电脑上运行大约需要一秒钟。你用什么硬件

？

path =''DonQuixote.txt''

frequency = {}

for line in open（path）：

for line in line.split（）：

if frequency。 has_key（字）：

频率[字] + = 1

否则：

频率[字] = 1

print len（频率），''words''

Kent

>>>>> ; "肯特" == Kent Johnson< ke **** @ tds.net>写道：

肯特> if frequency.has_key（word）：

Kent>频率[字] + = 1

肯特>否则：

肯特>频率[字] = 1

这是一个使用''get''方法的好地方：

频率[字] = frequency.get（word，0）+ 1

-

Ville Vainio http://tinyurl.com/2prnb

this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.

lines is expected to be a list of lines as provided by file.readline()

解决方案
"possibilitybox" <po************@gmail.com> writes:

this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
2MB is not that large. Your method is ok and shouldn''t be that slow
unless you''re on a pretty slow PC. Could your machine be short of
memory and paging a lot? You could tweak the code somewhat by moving
the initialization of the frequency dict out of the loop and combining
a few other statements. Also you should use xrange instead of range,
to avoid allocating a big list in memory:

def wordcount(lines):
frequency = {}
for i in xrange(len(lines)/8):
for word in lines[i].split():
frequency[word] = 1 + frequency.get(word, 0)
return frequency
wordcount(lines)
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.

It should be close to linear, or at worst n log n, depending on what
happens when dicts have to be enlarged as the # of elements increases.
Why are you only processing 1/8th of the lines?

possibilitybox wrote:
this code here:
def wordcount(lines):
for i in range(len(lines)/8):
words = lines[i].split(" ")
if not locals().has_key("frequency"):
frequency = {}
for word in words:
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
wordcount(lines)

is taking over six minutes to run on a two megabyte text file. i
realize that''s a big text file, a really big one (it''s actually the
full text of don quixote.). i''m trying to figure out how. is there a
better way for me to do a frequency count of all the words in the text?
it seems to me like this should scale linearly, but perhaps it isn''t?
i don''t know much about algorithmic complexity. if someone could give
a breakdown of this functions complexity as well i''d be much obliged.

lines is expected to be a list of lines as provided by file.readline()

Here is a little cleaner version. It takes about a second to run on my PC. What hardware are you
running on?

path = ''DonQuixote.txt''

frequency = {}

for line in open(path):
for word in line.split():
if frequency.has_key(word):
frequency[word] += 1
else:
frequency[word] = 1

print len(frequency), ''words''
Kent

>>>>> "Kent" == Kent Johnson <ke****@tds.net> writes:

Kent> if frequency.has_key(word):
Kent> frequency[word] += 1
Kent> else:
Kent> frequency[word] = 1

This is a good place to use ''get'' method of dict:

frequency[word] = frequency.get(word,0) + 1

--
Ville Vainio http://tinyurl.com/2prnb

这篇关于大字典创作需要很多时间。的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

大字典创作需要很多时间。 [英] large dictionary creation takes a LOT of time.

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

大字典创作需要很多时间。 [英] large dictionary creation takes a LOT of time.

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭