6.5 GB文件上的Pandas read_csv消耗超过170GB RAM [英] Pandas read_csv on 6.5 GB file consumes more than 170GB RAM
问题描述
我想提起这件事,只是因为它很疯狂。也许韦斯有一些想法。该文件非常规则:1100行x~3M列,数据以制表符分隔,仅由整数0,1和2组成。显然这不是预期的。
I wanted to bring this up, just because it's crazy weird. Maybe Wes has some idea. The file is pretty regular: 1100 rows x ~3M columns, data are tab-separated, consisting solely of the integers 0, 1, and 2. Clearly this is not expected.
如果我按如下方式预填充数据帧,则会消耗~26GB的RAM。
If I prepopulate a dataframe as below, it consumes ~26GB of RAM.
h = open("ms.txt")
header = h.readline().split("\t")
h.close()
rows=1100
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)
系统信息:
- python 2.7.9
- ipython 2.3.1
- numpy 1.9.1
- pandas 0.15.2。
- python 2.7.9
- ipython 2.3.1
- numpy 1.9.1
- pandas 0.15.2.
欢迎任何想法。
推荐答案
你的例子的问题。
在小的试试你的代码scale,我注意到即使你设置 dtype = int
,你实际上最终会得到 dtype = object
数据帧。
Problem of your example.
Trying your code on small scale, I notice even if you set dtype=int
, you are actually ending up with dtype=object
in your resulting dataframe.
header = ['a','b','c']
rows = 11
df = pd.DataFrame(columns=header, index=range(rows), dtype=int)
df.dtypes
a object
b object
c object
dtype: object
这是因为即使你给出 pd .read_csv
函数指令列 dtype = int
,它不能覆盖最终由列中的数据确定的dtypes。
This is because even though you give the pd.read_csv
function the instruction that the columns are dtype=int
, it cannot override the dtypes being ultimately determined by the data in the column.
这是因为pandas是 紧密耦合 到numpy和numpy dtypes。
This is because pandas is tightly coupled to numpy and numpy dtypes.
专业人士嗯,你创建的数据框中没有数据,因此numpy将数据默认为 np.NaN
,不适合整数。
The problem is, there is no data in your created dataframe, thus numpy defaults the data to be np.NaN
, which does not fit in an integer.
这意味着numpy会混淆并默认返回到 object
的dtype。
This means numpy gets confused and defaults back to the dtype being object
.
将dtype设置为 object
表示与你将dtype设置为整数或浮点数相比,内存消耗和分配时间的开销很大。
Having the dtype set to object
means a big overhead in memory consumption and allocation time compared to if you would have the dtype set as integer or float.
df = pd.DataFrame(columns=header, index=range(rows), dtype=float)
这很好用,因为 np.NaN
可以存在于浮点数中。这产生
This works just fine, since np.NaN
can live in a float. This produces
a float64
b float64
c float64
dtype: object
并且应该占用更少的内存。
And should take less memory.
有关dtype的详细信息,请参阅此相关帖子:
Pandas read_csv low_memory和dtype options
See this related post for details on dtype: Pandas read_csv low_memory and dtype options
这篇关于6.5 GB文件上的Pandas read_csv消耗超过170GB RAM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!