在Python中阅读巨大的文件 [英] Reading Huge File in Python
问题描述
我目前的方法包括在下面。这需要30秒。必须有更高效的Python foo才能将其降低到最多几秒的合理效率。
我们需要查找的键的排序列表#在列表的末尾有一个标记来简化代码
#我们使用指针来遍历键列表
for line in fin:
line = map(int,line.split())
while line [0] == list [pointer] .key:
list [pointer] .value = line [1 ]
pointer + = 1
while line [0]> list [pointer] .key:
pointer + = 1
如果指针> = len(list) - 1:
break#列表的结尾; -1是由于哨兵
编码二进制搜索+寻找解决方案(感谢kigurai!):
entries = 24935502#条目数
width = 18#用文本填充的条目的固定宽度
#在每行的末尾
为i,在枚举(列表)中搜索:#list包含搜索键列表
left,right = 0,entries-1
key = None
while key =搜索并且left <= right:
mid =(left + right)/ 2
fin.seek(mid * width)
key,value = map (int,fin.readline()。split())
如果搜索> key:
left = mid + 1
else:
right = mid - 1
if key!= search:
value = None#for search key is not找到
search.result = value#存储搜索结果
I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.
My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.
# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
line = map(int, line.split())
while line[0] == list[pointer].key:
list[pointer].value = line[1]
pointer += 1
while line[0] > list[pointer].key:
pointer += 1
if pointer >= len(list) - 1:
break # end of list; -1 is due to sentinel
Coded binary search + seek solution (thanks kigurai!):
entries = 24935502 # number of entries
width = 18 # fixed width of an entry in the file padded with spaces
# at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, entries-1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid * width)
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.
这篇关于在Python中阅读巨大的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!