在Python中加载大文件(25k条目)到Python是慢的? [英] Loading large file (25k entries) into dict is slow in Python?

查看:114
本文介绍了在Python中加载大文件(25k条目)到Python是慢的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件大约25000行,它是一个s19格式的文件。

I have a file which has about 25000 lines, and it's a s19 format file.

每一行都像:S214 780010 em> 00802000000010000000000A508CC78C 7A

each line is like: S214 780010 00802000000010000000000A508CC78C 7A

实际文件中没有空格,第一部分 780010 是这一行的地址,我希望它是一个dict的键值,我想要数据部分 00802000000010000000000A508CC78C 成为这个值键。我写了这样的代码:

There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:

def __init__(self,filename):
    infile = file(filename,'r')
    self.all_lines = infile.readlines()
    self.dict_by_address = {}

    for i in range(0, self.get_line_number()):
        self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)

    infile.close()

get_address_of_line()和get_data_of_line()都是简单的字符串切片功能。 get_line_number()遍历self.all_lines并返回一个int

get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int

问题是,init进程需要我超过1分钟,是我构造dict错误或python的方式需要这么长时间才能做到这一点吗?

problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?

顺便说一句,我是新来的python :)也许代码看起来更像C / C ++,任何建议如何程序像python是赞赏:)

And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)

推荐答案

这个代码应该比现在快得多。编辑:正如@sth指出的那样,这不起作用,因为实际文件中没有空格。我会在最后添加一个更正的版本。

This code should be tremendously faster than what you have now. As @sth pointed out, this doesn't work because there are no spaces in the actual file. I'll add a corrected version at the end.

def __init__(self,filename):
    self.dict_by_address = {}

    with open(filename, 'r') as infile:
        for line in infile:
            _, key, value, _ = line.split()
            self.dict_by_address[key] = value

有些评论:


  • Python的最佳实践是使用语句语句,除非您使用旧的Python没有它。

  • Best practice in Python is to use a with statement, unless you are using an old Python that doesn't have it.

最佳做法是使用 open()而不是文件();我不认为Python 3.x甚至有 file()

Best practice is to use open() rather than file(); I don't think Python 3.x even has file().

你可以使用打开文件对象作为迭代器,当您重复它时,您将从输入中获取一行。这比调用 .readlines()方法更好,它将所有数据全部删除到列表中;那么你一次使用数据并删除列表。由于输入文件很大,这意味着您可能会导致交换到虚拟内存,这总是很慢。此版本避免了构建和删除巨大列表。

You can use the open file object as an iterator, and when you iterate it you get one line from the input. This is better than calling the .readlines() method, which slurps all the data into a list; then you use the data one time and delete the list. Since the input file is large, that means you are probably causing swapping to virtual memory, which is always slow. This version avoids building and deleting the giant list.

然后,创建了一个巨大的输入行列表,您使用 range() 来制作一个大的整数列表。再次浪费时间和内存来构建列表,使用它一次,然后删除列表。您可以通过使用 xrange()来避免这种开销,但是更好的是仅在构建字典时,作为从文件读取行的同一循环的一部分。

Then, having created a giant list of input lines, you use range() to make a big list of integers. Again it wastes time and memory to build a list, use it once, then delete the list. You can avoid this overhead by using xrange() but even better is just to build the dictionary as you go, as part of the same loop that is reading lines from the file.

最好使用特殊的切片功能来拉出地址和数据字段,但如果输入是常规的(总是如下)你的例子的模式),你可以做我在这里展示的。 line.split()在白色空间分割行,列出四个字符串。然后我们使用解构赋值将其解压缩为四个变量。由于我们只想保存两个值,所以我为其他两个使用变量名 _ (一个下划线)。这不是一个真正的语言功能,但它是Python社区中的一个成语:当你有数据,你不在乎你可以将它分配给 _ 。如果有4个以外的值,则这行将引发异常,因此如果可能有空行或注释行或任何值,则应添加检查并处理错误(至少将该行包含在 try: / 除了)。

It might be better to use your special slicing functions to pull out the "address" and "data" fields, but if the input is regular (always follows the pattern of your example) you can just do what I showed here. line.split() splits the line on white space, giving a list of four strings. Then we unpack it into four variables using "destructuring assignment". Since we only want to save two of the values, I used the variable name _ (a single underscore) for the other two. That's not really a language feature, but it is an idiom in the Python community: when you have data you don't care about you can assign it to _. This line will raise an exception if there are ever any number of values other than 4, so if it is possible to have blank lines or comment lines or whatever, you should add checks and handle the error (at least wrap that line in a try:/except).

编辑:更正版本:

def __init__(self,filename):
    self.dict_by_address = {}

    with open(filename, 'r') as infile:
        for line in infile:
            key = extract_address(line) 
            value = extract_data(line)
            self.dict_by_address[key] = value

这篇关于在Python中加载大文件(25k条目)到Python是慢的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆