读取3.2 GB文件时,Pandas/Python内存激增 [英] Pandas/Python memory spike while reading 3.2 GB file
问题描述
所以我一直在尝试使用pandas read_csv
函数读取内存中的3.2GB文件,但是我一直遇到某种内存泄漏,我的内存使用率会飙升90%+
.
So I have been trying to read a 3.2GB file in memory using pandas read_csv
function but I kept on running into some sort of memory leak, my memory usage would spike 90%+
.
因此作为替代方法
-
我尝试定义
dtype
以避免将数据作为字符串保留在内存中,但是看到了类似的行为.
I tried defining
dtype
to avoid keeping the data in memory as strings, but saw similar behaviour.
尝试numpy读取csv,以为我会得到一些不同的结果,但绝对是错误的.
Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.
逐行尝试阅读时遇到了同样的问题,但速度确实很慢.
Tried reading line by line ran into the same problem, but really slowly.
我最近转移到python 3,因此认为那里可能存在一些错误,但是在python2 + pandas上看到了类似的结果.
I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.
有问题的文件是kaggle竞赛中的train.csv文件 grupo bimbo
The file in question is a train.csv file from a kaggle competition grupo bimbo
系统信息:
RAM: 16GB, Processor: i7 8cores
让我知道您是否还想了解其他任何信息.
Let me know if you would like to know anything else.
谢谢:)
其内存峰值!不是泄漏(对不起,我不好.)
EDIT 1: its a memory spike! not a leak (sorry my bad.)
csv文件示例
Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3
对文件 74180465
除了简单的pd.read_csv('filename', low_memory=False)
我尝试过
from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')
更新 下面的代码可以正常工作,但是我仍然想深入了解这个问题,一定有问题.
UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.
import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
data.append(sub_data)
gc.collect()
有效的代码段. 感谢所有帮助人员,我通过添加python dtype而不是numpy来弄乱了我的dtype.修复以下代码后,它就像是一种魅力.
Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.
dtypes = {'Semana': pd.np.int8,
'Agencia_ID':pd.np.int8,
'Canal_ID':pd.np.int8,
'Ruta_SAK':pd.np.int8,
'Cliente_ID':pd.np.int8,
'Producto_ID':pd.np.int8,
'Venta_uni_hoy':pd.np.int8,
'Venta_hoy':pd.np.float16,
'Dev_uni_proxima':pd.np.int8,
'Dev_proxima':pd.np.float16,
'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)
这将内存消耗降低到不足4Gb
This brought down the memory consumption to just under 4Gb
推荐答案
以文本形式存储在内存中的文件不像压缩二进制格式那样紧凑,但是在数据方面却相对紧凑.如果它是一个简单的ascii文件,则除了任何文件头信息外,每个字符只有1个字节. Python字符串具有类似的关系,内部python东西有一些开销,但是每个额外的字符仅增加1个字节(来自使用__sizeof__
进行的测试).一旦开始转换为数字类型和集合(列表,数组,数据框等),开销就会增加.例如,列表必须为每个位置存储类型和值,而字符串仅存储值.
A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__
). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.
>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128
进行一些测试(假设__sizeof__
是准确的):
A little bit of testing (assuming __sizeof__
is accurate):
import numpy as np
import pandas as pd
s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')
返回:
string: 53
list: 120
array: 136
byte array: 106
dataframe: 152
这篇关于读取3.2 GB文件时,Pandas/Python内存激增的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!