在Python中阅读巨大的文件 [英] Reading Huge File in Python

查看:102
本文介绍了在Python中阅读巨大的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有5000万行的384MB文本文件。每行包含2个空格分隔的整数:一个键和一个值。该文件按键排序。我需要一种有效的方式来查找Python中约200个键的列表。



我目前的方法包括在下面。这需要30秒。必须有更高效的Python foo才能将其降低到最多几秒的合理效率。

我们需要查找的键的排序列表
#在列表的末尾有一个标记来简化代码
#我们使用指针来遍历键列表
for line in fin:
line = map(int,line.split())
while line [0] == list [pointer] .key:
list [pointer] .value = line [1 ]
pointer + = 1
while line [0]> list [pointer] .key:
pointer + = 1
如果指针> = len(list) - 1:
break#列表的结尾; -1是由于哨兵

编码二进制搜索+寻找解决方案(感谢kigurai!):

  entries = 24935502#条目数
width = 18#用文本填充的条目的固定宽度
#在每行的末尾
为i,在枚举(列表)中搜索:#list包含搜索键列表
left,right = 0,entries-1
key = None
while key =搜索并且left <= right:
mid =(left + right)/ 2
fin.seek(mid * width)
key,value = map (int,fin.readline()。split())
如果搜索> key:
left = mid + 1
else:
right = mid - 1
if key!= search:
value = None#for search key is not找到
search.result = value#存储搜索结果


解决方案<如果你只需要5000万行中的200行,那么把它全部读到内存中是很浪费的。我将排序的搜索关键字列表,然后使用seek()或类似的应用二进制搜索文件。这样你不会读整个文件到内存,我认为应该加快速度。


I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

Coded binary search + seek solution (thanks kigurai!):

entries = 24935502 # number of entries
width   = 18       # fixed width of an entry in the file padded with spaces
                   # at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, entries-1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid * width)
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

解决方案

If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.

这篇关于在Python中阅读巨大的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆