Python:如何将巨大的文本文件读入内存 [英] Python: How to read huge text file into memory

查看:42
本文介绍了Python:如何将巨大的文本文件读入内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在具有 1GB RAM 的 Mac Mini 上使用 Python 2.6.我想读一个巨大的文本文件

I'm using Python 2.6 on a Mac Mini with 1GB RAM. I want to read in a huge text file

$ ls -l links.csv; file links.csv; tail links.csv 
-rw-r--r--  1 user  user  469904280 30 Nov 22:42 links.csv
links.csv: ASCII text, with CRLF line terminators
4757187,59883
4757187,99822
4757187,66546
4757187,638452
4757187,4627959
4757187,312826
4757187,6143
4757187,6141
4757187,3081726
4757187,58197

所以文件中的每一行都包含一个由两个逗号分隔的整数值组成的元组.我想读入整个文件并根据第二列对其进行排序.我知道,我可以在不将整个文件读入内存的情况下进行排序.但我认为对于 500MB 的文件,我应该仍然可以在内存中完成,因为我有 1GB 可用空间.

So each line in the file consists of a tuple of two comma separated integer values. I want to read in the whole file and sort it according to the second column. I know, that I could do the sorting without reading the whole file into memory. But I thought for a file of 500MB I should still be able to do it in memory since I have 1GB available.

但是,当我尝试读入文件时,Python 似乎分配了比磁盘上文件所需的内存多得多的内存.因此,即使有 1GB 的 RAM,我也无法将 500MB 的文件读入内存.我用于读取文件并打印有关内存消耗的一些信息的 Python 代码是:

However when I try to read in the file, Python seems to allocate a lot more memory than is needed by the file on disk. So even with 1GB of RAM I'm not able to read in the 500MB file into memory. My Python code for reading the file and printing some information about the memory consumption is:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

infile=open("links.csv", "r")

edges=[]
count=0
#count the total number of lines in the file
for line in infile:
 count=count+1

total=count
print "Total number of lines: ",total

infile.seek(0)
count=0
for line in infile:
 edge=tuple(map(int,line.strip().split(",")))
 edges.append(edge)
 count=count+1
 # for every million lines print memory consumption
 if count%1000000==0:
  print "Position: ", edge
  print "Read ",float(count)/float(total)*100,"%."
  mem=sys.getsizeof(edges)
  for edge in edges:
   mem=mem+sys.getsizeof(edge)
   for node in edge:
    mem=mem+sys.getsizeof(node) 

  print "Memory (Bytes): ", mem 

我得到的输出是:

Total number of lines:  30609720
Position:  (9745, 2994)
Read  3.26693612356 %.
Memory (Bytes):  64348736
Position:  (38857, 103574)
Read  6.53387224712 %.
Memory (Bytes):  128816320
Position:  (83609, 63498)
Read  9.80080837067 %.
Memory (Bytes):  192553000
Position:  (139692, 1078610)
Read  13.0677444942 %.
Memory (Bytes):  257873392
Position:  (205067, 153705)
Read  16.3346806178 %.
Memory (Bytes):  320107588
Position:  (283371, 253064)
Read  19.6016167413 %.
Memory (Bytes):  385448716
Position:  (354601, 377328)
Read  22.8685528649 %.
Memory (Bytes):  448629828
Position:  (441109, 3024112)
Read  26.1354889885 %.
Memory (Bytes):  512208580

在仅读取 500MB 文件的 25% 之后,Python 消耗了 500MB.因此,将文件内容存储为整数元组列表似乎不是很节省内存.有没有更好的方法来做到这一点,以便我可以将 500MB 的文件读入 1GB 的内存中?

Already after reading only 25% of the 500MB file, Python consumes 500MB. So it seem that storing the content of the file as a list of tuples of ints is not very memory efficient. Is there a better way to do it, so that I can read in my 500MB file into my 1GB of memory?

推荐答案

有一个对大于 RAM 的文件进行排序的方法 在此页面上,但您必须针对涉及 CSV 格式数据的情况对其进行调整.还有其他资源的链接.

There is a recipe for sorting files larger than RAM on this page, though you'd have to adapt it for your case involving CSV-format data. There are also links to additional resources there.

确实,磁盘上的文件并不大于 RAM",但内存中的表示很容易变得比可用 RAM 大得多.一方面,您自己的程序没有获得整个 1GB(操作系统开销等).另一方面,即使您以最紧凑的形式存储纯 Python(两个整数列表,假设是 32 位机器等),对于 30M 的整数对,您也将使用 934MB.

True, the file on disk is not "larger than RAM", but the in-memory representation can easily become much larger than available RAM. For one thing, your own program doesn't get the entire 1GB (OS overhead etc). For another, even if you stored this in the most compact form for pure Python (two lists of integers, assuming 32-bit machine etc), you'd be using 934MB for those 30M pairs of integers.

使用 numpy 你也可以完成这项工作,只使用大约 250MB.以这种方式加载并不是特别快,因为您必须计算行数并预先分配数组,但鉴于它在内存中,它可能是最快的实际排序:

Using numpy you can also do the job, using only about 250MB. It isn't particular fast to load this way, as you have to count the lines and pre-allocate the array, but it may be the fastest actual sort given that it's in-memory:

import time
import numpy as np
import csv

start = time.time()
def elapsed():
    return time.time() - start

# count data rows, to preallocate array
f = open('links.csv', 'rb')
def count(f):
    while 1:
        block = f.read(65536)
        if not block:
             break
        yield block.count(',')

linecount = sum(count(f))
print '
%.3fs: file has %s rows' % (elapsed(), linecount)

# pre-allocate array and load data into array
m = np.zeros(linecount, dtype=[('a', np.uint32), ('b', np.uint32)])
f.seek(0)
f = csv.reader(open('links.csv', 'rb'))
for i, row in enumerate(f):
    m[i] = int(row[0]), int(row[1])

print '%.3fs: loaded' % elapsed()
# sort in-place
m.sort(order='b')

print '%.3fs: sorted' % elapsed()

在我的机器上输出一个类似于你显示的示例文件:

Output on my machine with a sample file similar to what you showed:

6.139s: file has 33253213 lines
238.130s: read into memory
517.669s: sorted

numpy 中的默认值是 Quicksort.ndarray.sort() 例程(就地排序)也可以采用关键字参数 kind="mergesort"kind="heapsort" 但它没有出现能够对 Record Array 进行排序,顺便说一句,我用它作为我能看到的唯一排序方式列一起,而不是默认的将它们独立排序(完全弄乱你的数据).

The default in numpy is Quicksort. The ndarray.sort() routine (which sorts in-place) can also take keyword argument kind="mergesort" or kind="heapsort" but it appears neither of these is capable of sorting on a Record Array which, incidentally, I used as the only way I could see to sort the columns together as opposed to the default which would sort them independently (totally messing up your data).

这篇关于Python:如何将巨大的文本文件读入内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆