为什么Pandas数据帧消耗的RAM比原始文本文件大得多? [英] Why does a pandas dataframe consumes much more RAM than the size of the original text file?
问题描述
我正在尝试使用pandas pd.read_csv("file.txt",sep="\t")
将较大的tab/txt(大小= 3 gb)文件导入Python.我加载的文件是一个".tab"文件,我将其扩展名更改为".txt",以使用read_csv()
导入该文件.这是一个具有305列和+/- 1000000行的文件.
I'm trying to import a large tab/txt (size = 3 gb) file into Python using pandas pd.read_csv("file.txt",sep="\t")
. The file I load was a ".tab" file of which I changed the extension to ".txt" to import it with read_csv()
. It is a file with 305 columns and +/- 1 000 000 rows.
当我执行代码时,一段时间后Python返回MemoryError.我搜索了一些信息,这基本上意味着没有足够的RAM.当我在read_csv()
中指定nrows = 20
时,效果很好.
When I execute the code, after some time Python returns a MemoryError. I searched for some information and this basically means that there is not enough RAM available. When I specify nrows = 20
in read_csv()
it works fine.
我正在使用的计算机具有46gb的RAM,其中大约20gb可用于Python.
The computer I'm using has 46gb of RAM of which roughly 20 gb was available for Python.
我的问题:3gb的文件如何可能需要使用熊猫read_csv()
将超过20gb的RAM导入Python?我做错什么了吗?
My question: How is it possible that a file of 3gb needs more than 20gb of RAM to be imported into Python using pandas read_csv()
? Am I doing anything wrong?
编辑:执行df.dtypes
时,类型是object
,float64
和int64
When executing df.dtypes
the types are a mix of object
, float64
, and int64
更新:我使用以下代码克服了该问题并进行了计算:
UPDATE: I used the following code to overcome the problem and perform my calculations:
summed_cols=pd.DataFrame(columns=["sample","read sum"])
while x<352:
x=x+1
sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
del sample_col
现在它选择一列,执行计算,将结果存储在数据框中,删除当前列,然后移至下一列
it now selects a column, performs a calculation, stores the result in a dataframe, deletes the current column, and moves to the next column
推荐答案
Pandas正在分割文件,并分别存储数据.我不知道数据类型,所以我假设最糟糕的是:字符串.
Pandas is cutting up the file, and storing the data individually. I don't know the data types, so I'll assume the worst: strings.
在Python(在我的机器上)中,一个空字符串需要49个字节,如果是ASCII,则每个字符都有一个额外的字节(如果是Unicode,则是74个字节,每个字符另外需要2个字节).一行305个空字段大约等于15Kb.一百万个这样的行将占用大约22Gb的内存,而在CSV文件中则需要约437 Mb.
In Python (on my machine), an empty string needs 49 bytes, with an additional byte for each character if ASCII (or 74 bytes with extra 2 bytes for each character if Unicode). That's roughly 15Kb for a row of 305 empty fields. A million and a half of such rows would take roughly 22Gb in memory, while they would take about 437 Mb in a CSV file.
Pandas/numpy很好地使用数字,因为它们可以非常紧凑地表示数字序列(就像C程序那样).一旦您脱离了C兼容的数据类型,它就会像Python一样使用内存,这不是很节俭.
Pandas/numpy are good with numbers, as they can represent a numerical series very compactly (like C program would). As soon as you step away from C-compatible datatypes, it uses memory as Python does, which is... not very frugal.
这篇关于为什么Pandas数据帧消耗的RAM比原始文本文件大得多?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!