以内存有效的方式将大csv读取到稀疏的pandas数据帧中 [英] Read a large csv into a sparse pandas dataframe in a memory efficient way

查看:110
本文介绍了以内存有效的方式将大csv读取到稀疏的pandas数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pandas read_csv函数似乎没有稀疏选项.我的csv数据中包含大量的零(压缩效果非常好,去掉任何0值会将其减小到几乎原始大小的一半).

The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size).

我尝试先使用read_csv将其加载到密集矩阵中,然后再调用to_sparse,但是它花费了很长时间并且在文本字段上造成阻塞,尽管大多数数据都是浮点数.如果我先调用pandas.get_dummies(df)将分类列转换为& ;,零,然后调用to_sparse(fill_value=0)则需要花费大量的时间,比我对具有1200万个条目(大多数为零)的数字表所期望的时间要长得多.即使我将零从原始文件中删除并调用to_sparse()(以使填充值为NaN),也会发生这种情况.无论我是否通过kind='block'kind='integer',也会发生这种情况.

I've tried loading it into a dense matrix first with read_csv and then calling to_sparse, but it takes a long time and chokes on text fields, although most of the data is floating point. If I call pandas.get_dummies(df) first to convert the categorical columns to ones & zeros, then call to_sparse(fill_value=0) it takes an absurd amount of time, much longer than I would expect for a mostly numeric table that has 12 million entries, mostly zero. This happens even if I strip the zeros out of the original file and call to_sparse() (so that the fill value is NaN). This also happens regardless of whether I pass kind='block' or kind='integer'.

除了手动构建稀疏数据帧之外,是否有一种很好的,平滑的方法直接加载稀疏的csv,而不会吃掉不必要的内存呢?

Other than building the sparse dataframe by hand, is there a good, smooth way to load a sparse csv directly without eating up gobs of unnecessary memory?

以下是一些代码,用于创建一个示例数据集,该数据集具有3列浮点数据和1列文本数据.大约85%的浮点值是零,并且CSV的总大小约为300 MB,但您可能需要加大此值才能真正测试内存限制.

Here is some code to create a sample dataset that has 3 columns of floating point data and one column of text data. Approximately 85% of the float values are zero and the total size of the CSV is approximately 300 MB but you will probably want to make this larger to really test the memory constraints.

np.random.seed(123)
df=pd.DataFrame( np.random.randn(10000000,3) , columns=list('xyz') )
df[ df < 1.0 ] = 0.0
df['txt'] = np.random.choice( list('abcdefghij'), size=len(df) )
df.to_csv('test.csv',index=False)

这是一种简单的阅读方法,但希望有一种更好,更有效的方法:

And here is a simple way to read it, but hopefully there is a better, more efficient way:

sdf = pd.read_csv( 'test.csv', dtype={'txt':'category'} ).to_sparse(fill_value=0.0)

编辑要添加的内容(来自JohnE):如果可能,请提供一些有关读取答案中的大CSV的相对性能统计信息,包括有关如何测量内存效率的信息(尤其是内存效率较难的情况)比时钟时间来衡量).特别要注意的是,如果内存使用效率更高,那么较慢的(时钟时间)答案可能是最好的答案.

Edit to Add (from JohnE): If possible, please provide some relative performance stats on reading large CSVs in your answer, including info on how you measured memory efficiency (especially as memory efficiency is harder to measure than clock time). In particular, note that a slower (clock time) answer could be the best answer here, if it is more memory efficient.

推荐答案

我可能会使用 dask 以流方式加载数据.例如,您可以创建一个dask数据框,如下所示:

I would probably address this by using dask to load your data in a streaming fashion. For example, you can create a dask dataframe as follows:

import dask.dataframe as ddf
data = ddf.read_csv('test.csv')

data对象目前尚未实际执行任何操作.它仅包含某种配方",可以以可管理的块形式从磁盘读取数据帧.如果要实现数据,可以调用compute():

This data object hasn't actually done anything at this point; it just contains a "recipe" of sorts to read the dataframe from disk in manageable chunks. If you want to materialize the data, you can call compute():

df = data.compute().reset_index(drop=True)

这时,您有了一个标准的pandas数据框(我们将其称为reset_index,因为默认情况下每个分区都是独立索引的).结果等同于直接调用pd.read_csv会得到的结果:

At this point, you have a standard pandas dataframe (we call reset_index because by default each partition is independently indexed). The result is equivalent to what you get by calling pd.read_csv directly:

df.equals(pd.read_csv('test.csv'))
# True

dask的好处是您可以向此食谱"中添加指令以构建数据框;例如,您可以使数据的每个分区稀疏,如下所示:

The benefit of dask is you can add instructions to this "recipe" for constructing your dataframe; for example, you could make each partition of the data sparse as follows:

data = data.map_partitions(lambda part: part.to_sparse(fill_value=0))

此时,调用compute()将构造一个稀疏数组:

At this point, calling compute() will construct a sparse array:

df = data.compute().reset_index(drop=True)
type(df)
# pandas.core.sparse.frame.SparseDataFrame

分析

要检查dask方法与原始pandas方法的比较,让我们进行一些行剖析.我将使用lprunmprun,如此处所述a>(完全披露:这是我自己的书的一部分).

Profiling

To check how the dask approach compares to the raw pandas approach, let's do some line profiling. I'll use lprun and mprun, as described here (full disclosure: that's a section of my own book).

假设您正在使用Jupyter笔记本,则可以通过以下方式运行它:

Assuming you're working in the Jupyter notebook, you can run it this way:

首先,使用我们要执行的基本任务创建一个单独的文件:

First, create a separate file with the basic tasks we want to do:

%%file dask_load.py

import numpy as np
import pandas as pd
import dask.dataframe as ddf

def compare_loads():
    df = pd.read_csv('test.csv')
    df_sparse = df.to_sparse(fill_value=0)

    df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    df_dask = df_dask.compute().reset_index(drop=True)

接下来让我们逐行分析计算时间:

Next let's do line-by-line profiling for computation time:

%load_ext line_profiler

from dask_load import compare_loads
%lprun -f compare_loads compare_loads()

我得到以下结果:

Timer unit: 1e-06 s

Total time: 13.9061 s
File: /Users/jakevdp/dask_load.py
Function: compare_loads at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           def compare_loads():
     7         1      4746788 4746788.0     34.1      df = pd.read_csv('test.csv')
     8         1       769303 769303.0      5.5      df_sparse = df.to_sparse(fill_value=0)
     9                                           
    10         1        33992  33992.0      0.2      df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    11         1         7848   7848.0      0.1      df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    12         1      8348217 8348217.0     60.0      df_dask = df_dask.compute().reset_index(drop=True)

对于上面的示例数组,我们看到大约60%的时间花费在dask调用中,而大约40%的时间花费在pandas调用中.这告诉我们,完成此任务的速度比熊猫慢大约50%:这是可以预料的,因为数据分区的分块和重组会导致一些额外的开销.

We see that about 60% of the time is spent in the dask call, while about 40% of the time is spent in the pandas call for the example array above. This tells us that dask is about 50% slower than pandas for this task: this is to be expected, because the chunking and recombining of data partitions leads to some extra overhead.

内存使用率大放光彩:让我们使用mprun逐行创建内存配置文件:

Where dask shines is in memory usage: let's use mprun to do a line-by-line memory profile:

%load_ext memory_profiler
%mprun -f compare_loads compare_loads()

我的机器上的结果是这样的:

The result on my machine is this:

Filename: /Users/jakevdp/dask_load.py

Line #    Mem usage    Increment   Line Contents
================================================
     6     70.9 MiB     70.9 MiB   def compare_loads():
     7    691.5 MiB    620.6 MiB       df = pd.read_csv('test.csv')
     8    828.8 MiB    137.3 MiB       df_sparse = df.to_sparse(fill_value=0)
     9                             
    10    806.3 MiB    -22.5 MiB       df_dask = ddf.read_csv('test.csv', blocksize=10E6)
    11    806.4 MiB      0.1 MiB       df_dask = df_dask.map_partitions(lambda part: part.to_sparse(fill_value=0))
    12    947.9 MiB    141.5 MiB       df_dask = df_dask.compute().reset_index(drop=True)

我们看到最终的pandas数据帧大小约为140MB,但是pandas在将数据读取到临时密集对象中的过程中使用了约620MB.

We see that the final pandas dataframe size is about ~140MB, but pandas uses ~620MB along the way as it reads the data into a temporary dense object.

另一方面,dask在加载数组和构造最终的稀疏结果时仅使用约140MB的空间.如果您正在读取的密集数据量可与系统上可用的内存相媲美的数据,尽管计算时间缩短了约50%,但dask具有明显的优势.

On the other hand, dask only uses ~140MB total in loading the array and constructing the final sparse result. In the case that you are reading data whose dense size is comparable to the memory available on your system, dask has a clear advantage, despite the ~50% slower computational time.

但是对于处理大数据,您不应该在这里停下来.大概您正在对数据进行一些操作,而dask数据框抽象允许您在实现数据之前进行这些操作(即,将它们添加到配方"中).因此,如果您要处理的数据涉及算术,聚合,分组等,您甚至不必担心稀疏存储:只需对dask对象执行这些操作,最后调用compute(),然后dask会注意以内存有效的方式应用它们.

But for working with large data, you should not stop here. Presumably you're doing some operations on your data, and the dask dataframe abstraction allows you to do those operations (i.e. add them to the "recipe") before ever materializing the data. So if what you're doing with the data involves arithmetic, aggregations, grouping, etc. you don't even need to worry about the sparse storage: just do those operations with the dask object, call compute() at the end, and dask will take care of applying them in a memory efficient way.

例如,我可以使用dask数据帧计算每一列的max(),而不必一次将整个内容加载到内存中.

So, for example, I could compute the max() of each column using the dask dataframe, without ever having to load the whole thing into memory at once:

>>> data.max().compute()
x      5.38114
y      5.33796
z      5.25661
txt          j
dtype: object

直接使用dask数据帧将使您避免对数据表示的担心,因为您可能永远不必一次将所有数据加载到内存中.

Working with dask dataframes directly will allow you to circumvent worries about data representation, because you'll likely never have to load all the data into memory at once.

祝你好运!

这篇关于以内存有效的方式将大csv读取到稀疏的pandas数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆