读取3.2 GB文件时,Pandas/Python内存激增 [英] Pandas/Python memory spike while reading 3.2 GB file

查看:220
本文介绍了读取3.2 GB文件时,Pandas/Python内存激增的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我一直在尝试使用pandas read_csv函数读取内存中的3.2GB文件,但是我一直遇到某种内存泄漏,我的内存使用率会飙升90%+.

So I have been trying to read a 3.2GB file in memory using pandas read_csv function but I kept on running into some sort of memory leak, my memory usage would spike 90%+.

因此作为替代方法

  1. 我尝试定义dtype以避免将数据作为字符串保留在内存中,但是看到了类似的行为.

  1. I tried defining dtype to avoid keeping the data in memory as strings, but saw similar behaviour.

尝试numpy读取csv,以为我会得到一些不同的结果,但绝对是错误的.

Tried out numpy read csv, thinking I would get some different results but was definitely wrong about that.

逐行尝试阅读时遇到了同样的问题,但速度确实很慢.

Tried reading line by line ran into the same problem, but really slowly.

我最近转移到python 3,因此认为那里可能存在一些错误,但是在python2 + pandas上看到了类似的结果.

I recently moved to python 3, so thought there could be some bug there, but saw similar results on python2 + pandas.

有问题的文件是kaggle竞赛中的train.csv文件 grupo bimbo

The file in question is a train.csv file from a kaggle competition grupo bimbo

系统信息:

RAM: 16GB, Processor: i7 8cores

让我知道您是否还想了解其他任何信息.

Let me know if you would like to know anything else.

谢谢:)

其内存峰值!不是泄漏(对不起,我不好.)

EDIT 1: its a memory spike! not a leak (sorry my bad.)

csv文件示例

Semana,Agencia_ID,Canal_ID,Ruta_SAK,Cliente_ID,Producto_ID,Venta_uni_hoy,Venta_hoy,Dev_uni_proxima,Dev_proxima,Demanda_uni_equil
3,1110,7,3301,15766,1212,3,25.14,0,0.0,3
3,1110,7,3301,15766,1216,4,33.52,0,0.0,4
3,1110,7,3301,15766,1238,4,39.32,0,0.0,4
3,1110,7,3301,15766,1240,4,33.52,0,0.0,4
3,1110,7,3301,15766,1242,3,22.92,0,0.0,3

对文件 74180465

除了简单的pd.read_csv('filename', low_memory=False)

我尝试过

from numpy import genfromtxt
my_data = genfromtxt('data/train.csv', delimiter=',')

更新 下面的代码可以正常工作,但是我仍然想深入了解这个问题,一定有问题.

UPDATE The below code just worked, but I still want to get to the bottom of this problem, there must be something wrong.

import pandas as pd
import gc
data = pd.DataFrame()
data_iterator = pd.read_csv('data/train.csv', chunksize=100000)
for sub_data in data_iterator:
    data.append(sub_data)
    gc.collect()

有效的代码段. 感谢所有帮助人员,我通过添加python dtype而不是numpy来弄乱了我的dtype.修复以下代码后,它就像是一种魅力.

Piece of Code that worked. Thanks for all the help guys, I had messed up my dtypes by adding python dtypes instead of numpy ones. Once I fixed that the below code worked like a charm.

dtypes = {'Semana': pd.np.int8,
          'Agencia_ID':pd.np.int8,
          'Canal_ID':pd.np.int8,
          'Ruta_SAK':pd.np.int8,
          'Cliente_ID':pd.np.int8,
          'Producto_ID':pd.np.int8,
          'Venta_uni_hoy':pd.np.int8,
          'Venta_hoy':pd.np.float16,
          'Dev_uni_proxima':pd.np.int8,
          'Dev_proxima':pd.np.float16,
          'Demanda_uni_equil':pd.np.int8}
data = pd.read_csv('data/train.csv', dtype=dtypes)

这将内存消耗降低到不足4Gb

This brought down the memory consumption to just under 4Gb

推荐答案

以文本形式存储在内存中的文件不像压缩二进制格式那样紧凑,但是在数据方面却相对紧凑.如果它是一个简单的ascii文件,则除了任何文件头信息外,每个字符只有1个字节. Python字符串具有类似的关系,内部python东西有一些开销,但是每个额外的字符仅增加1个字节(来自使用__sizeof__进行的测试).一旦开始转换为数字类型和集合(列表,数组,数据框等),开销就会增加.例如,列表必须为每个位置存储类型和值,而字符串仅存储值.

A file stored in memory as text is not as compact as a compressed binary format, however it is relatively compact data-wise. If it's a simple ascii file, aside from any file header information, each character is only 1 byte. Python strings have a similar relation, where there's some overhead for internal python stuff, but each extra character adds only 1 byte (from testing with __sizeof__). Once you start converting to numeric types and collections (lists, arrays, data frames, etc.) the overhead will grow. A list for example must store a type and a value for each position, whereas a string only stores a value.

>>> s = '3,1110,7,3301,15766,1212,3,25.14,0,0.0,3\r\n'
>>> l = [3,1110,7,3301,15766,1212,3,25.14,0,0.0,3]
>>> s.__sizeof__()
75
>>> l.__sizeof__()
128

进行一些测试(假设__sizeof__是准确的):

A little bit of testing (assuming __sizeof__ is accurate):

import numpy as np
import pandas as pd

s = '1,2,3,4,5,6,7,8,9,10'
print ('string: '+str(s.__sizeof__())+'\n')
l = [1,2,3,4,5,6,7,8,9,10]
print ('list: '+str(l.__sizeof__())+'\n')
a = np.array([1,2,3,4,5,6,7,8,9,10])
print ('array: '+str(a.__sizeof__())+'\n')
b = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.dtype('u1'))
print ('byte array: '+str(b.__sizeof__())+'\n')
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])
print ('dataframe: '+str(df.__sizeof__())+'\n')

返回:

string: 53

list: 120

array: 136

byte array: 106

dataframe: 152

这篇关于读取3.2 GB文件时,Pandas/Python内存激增的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆