6.5 GB文件上的Pandas read_csv消耗超过170GB RAM [英] Pandas read_csv on 6.5 GB file consumes more than 170GB RAM

查看:245
本文介绍了6.5 GB文件上的Pandas read_csv消耗超过170GB RAM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提起这件事,只是因为它很疯狂。也许韦斯有一些想法。该文件非常规则:1100行x~3M列,数据以制表符分隔,仅由整数0,1和2组成。显然这不是预期的。

I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.

如果我按如下方式预填充数据帧,则会消耗~26GB的RAM。

If I prepopulate a dataframe as below, it consumes ~26GB of RAM.

h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

系统信息:


  • python 2.7.9

  • ipython 2.3.1

  • numpy 1.9.1

  • pandas 0.15.2。

  • python 2.7.9
  • ipython 2.3.1
  • numpy 1.9.1
  • pandas 0.15.2.

欢迎任何想法。

推荐答案

你的例子的问题。



在小的试试你的代码scale,我注意到即使你设置 dtype = int ,你实际上最终会得到 dtype = object 数据帧。

Problem of your example.

Trying your code on small scale, I notice even if you set dtype=int, you are actually ending up with dtype=object in your resulting dataframe.

header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)

df.dtypes
a    object
b    object
c    object
dtype: object

这是因为即使你给出 pd .read_csv 函数指令列 dtype = int ,它不能覆盖最终由列中的数据确定的dtypes。

This is because even though you give the pd.read_csv function the instruction that the columns are dtype=int, it cannot override the dtypes being ultimately determined by the data in the column.

这是因为pandas是 紧密耦合 到numpy和numpy dtypes。

This is because pandas is tightly coupled to numpy and numpy dtypes.

专业人士嗯,你创建的数据框中没有数据,因此numpy将数据默认为 np.NaN 不适合整数。

The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN, which does not fit in an integer.

这意味着numpy会混淆并默认返回到 object 的dtype。

This means numpy gets confused and defaults back to the dtype being object.

将dtype设置为 object 表示与你将dtype设置为整数或浮点数相比,内存消耗和分配时间的开销很大。

Having the dtype set to object means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.

df = pd.DataFrame(columns=header, index=range(rows), dtype=float)

这很好用,因为 np.NaN 可以存在于浮点数中。这产生

This works just fine, since np.NaN can live in a float. This produces

a    float64
b    float64
c    float64
dtype: object

并且应该占用更少的内存。

And should take less memory.

有关dtype的详细信息,请参阅此相关帖子:
Pandas read_csv low_memory和dtype options

See this related post for details on dtype: Pandas read_csv low_memory and dtype options

这篇关于6.5 GB文件上的Pandas read_csv消耗超过170GB RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆