创建大型Pandas DataFrame:预分配vs追加vs concat [英] Creating large Pandas DataFrames: preallocation vs append vs concat

查看:131
本文介绍了创建大型Pandas DataFrame:预分配vs追加vs concat的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在逐块构建大型数据帧时,我对Pandas中的性能感到困惑.在Numpy中,通过预分配一个大的空数组然后填充值,我们(几乎)总是会看到更好的性能.据我了解,这是由于Numpy一次获取了它需要的所有内存,而不必每次执行append操作都重新分配内存.

I am confused by the performance in Pandas when building a large dataframe chunk by chunk. In Numpy, we (almost) always see better performance by preallocating a large empty array and then filling in the values. As I understand it, this is due to Numpy grabbing all the memory it needs at once instead of having to reallocate memory with every append operation.

在Pandas中,通过使用df = df.append(temp)模式,我似乎获得了更好的性能.

In Pandas, I seem to be getting better performance by using the df = df.append(temp) pattern.

这是一个有关计时的例子. Timer类的定义如下.如您所见,我发现预分配的速度比使用append慢大约10倍!预先为数据帧分配具有适当dtype的np.empty值很有帮助,但是append方法仍然是最快的.

Here is an example with timing. The definition of the Timer class follows. As you, see I find that preallocating is roughly 10x slower than using append! Preallocating a dataframe with np.empty values of the appropriate dtype helps a great deal, but the append method is still the fastest.

import numpy as np
from numpy.random import rand
import pandas as pd

from timer import Timer

# Some constants
num_dfs = 10  # Number of random dataframes to generate
n_rows = 2500
n_cols = 40
n_reps = 100  # Number of repetitions for timing

# Generate a list of num_dfs dataframes of random values
df_list = [pd.DataFrame(rand(n_rows*n_cols).reshape((n_rows, n_cols)), columns=np.arange(n_cols)) for i in np.arange(num_dfs)]

##
# Define two methods of growing a large dataframe
##

# Method 1 - append dataframes
def method1():
    out_df1 = pd.DataFrame(columns=np.arange(4))
    for df in df_list:
        out_df1 = out_df1.append(df, ignore_index=True)
    return out_df1

def method2():
# # Create an empty dataframe that is big enough to hold all the dataframes in df_list
out_df2 = pd.DataFrame(columns=np.arange(n_cols), index=np.arange(num_dfs*n_rows))
#EDIT_1: Set the dtypes of each column
for ix, col in enumerate(out_df2.columns):
    out_df2[col] = out_df2[col].astype(df_list[0].dtypes[ix])
# Fill in the values
for ix, df in enumerate(df_list):
    out_df2.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
return out_df2

# EDIT_2: 
# Method 3 - preallocate dataframe with np.empty data of appropriate type
def method3():
    # Create fake data array
    data = np.transpose(np.array([np.empty(n_rows*num_dfs, dtype=dt) for dt in df_list[0].dtypes]))
    # Create placeholder dataframe
    out_df3 = pd.DataFrame(data)
    # Fill in the real values
    for ix, df in enumerate(df_list):
        out_df3.iloc[ix*n_rows:(ix+1)*n_rows, :] = df.values
    return out_df3

##
# Time both methods
##

# Time Method 1
times_1 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
       df1 = method1()
    times_1[i] = t.secs
print 'Total time for %d repetitions of Method 1: %f [sec]' % (n_reps, np.sum(times_1))
print 'Best time: %f' % (np.min(times_1))
print 'Mean time: %f' % (np.mean(times_1))

#>>  Total time for 100 repetitions of Method 1: 2.928296 [sec]
#>>  Best time: 0.028532
#>>  Mean time: 0.029283

# Time Method 2
times_2 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df2 = method2()
    times_2[i] = t.secs
print 'Total time for %d repetitions of Method 2: %f [sec]' % (n_reps, np.sum(times_2))
print 'Best time: %f' % (np.min(times_2))
print 'Mean time: %f' % (np.mean(times_2))

#>>  Total time for 100 repetitions of Method 2: 32.143247 [sec]
#>>  Best time: 0.315075
#>>  Mean time: 0.321432

# Time Method 3
times_3 = np.empty(n_reps)
for i in np.arange(n_reps):
    with Timer() as t:
        df3 = method3()
    times_3[i] = t.secs
print 'Total time for %d repetitions of Method 3: %f [sec]' % (n_reps, np.sum(times_3))
print 'Best time: %f' % (np.min(times_3))
print 'Mean time: %f' % (np.mean(times_3))

#>>  Total time for 100 repetitions of Method 3: 6.577038 [sec]
#>>  Best time: 0.063437
#>>  Mean time: 0.065770

我使用了很好的Timer,由Huy Nguyen提供:

I use a nice Timer courtesy of Huy Nguyen:

# credit: http://www.huyng.com/posts/python-performance-analysis/

import time

class Timer(object):
    def __init__(self, verbose=False):
        self.verbose = verbose

    def __enter__(self):
        self.start = time.clock()
        return self

    def __exit__(self, *args):
        self.end = time.clock()
        self.secs = self.end - self.start
        self.msecs = self.secs * 1000  # millisecs
        if self.verbose:
            print 'elapsed time: %f ms' % self.msecs

如果您仍在关注,我有两个问题:

If you are still following, I have two questions:

1)为什么append方法更快? (注意:对于非常小的数据帧,即n_rows = 40,它实际上要慢一些.)

1) Why is the append method faster? (NOTE: for very small dataframes, i.e. n_rows = 40, it is actually slower).

2)用块构建大型数据框的最有效方法是什么? (就我而言,这些块都是大的csv文件).

2) What is the most efficient way to build a large dataframe out of chunks? (In my case, the chunks are all large csv files).

感谢您的帮助!

EDIT_1: 在我的实际项目中,列具有不同的dtype.因此,按照BrenBarn的建议,我无法使用pd.DataFrame(.... dtype=some_type)技巧来提高预分配的性能. dtype参数强制所有列都使用相同的dtype [Ref.问题 4464]

EDIT_1: In my real world project, the columns have different dtypes. So I cannot use the pd.DataFrame(.... dtype=some_type) trick to improve the performance of preallocation, per BrenBarn's recommendation. The dtype parameter forces all the columns to be the same dtype [Ref. issue 4464]

我在我的代码中向method2()添加了一些行,以逐列更改dtypes以匹配输入数据帧.此操作很昂贵,并且在编写行块时会否定使用适当的dtypes的好处.

I added some lines to method2() in my code to change the dtypes column-by-column to match in the input dataframes. This operation is expensive and negates the benefits of having the appropriate dtypes when writing blocks of rows.

EDIT_2:尝试使用占位符数组np.empty(... dtyp=some_type)预分配数据框.根据@Joris的建议.

EDIT_2: Try preallocating a dataframe using placeholder array np.empty(... dtyp=some_type). Per @Joris's suggestion.

推荐答案

您的基准测试实际上太小,无法显示出真正的区别. 追加,每次复制,因此您实际上是在复制N大小的存储空间N *(N-1)次.随着数据框大小的增加,这效率极低.在很小的框架内,这当然可能无关紧要.但是,如果您有任何实际尺寸,这很重要.在文档此处中对此进行了特别说明. ,虽然有点警告.

Your benchmark is actually too small to show the real difference. Appending, copies EACH time, so you are actually doing copying a size N memory space N*(N-1) times. This is horribly inefficient as the size of your dataframe grows. This certainly might not matter in a very small frame. But if you have any real size this matters a lot. This is specifically noted in the docs here, though kind of a small warning.

In [97]: df = DataFrame(np.random.randn(100000,20))

In [98]: df['B'] = 'foo'

In [99]: df['C'] = pd.Timestamp('20130101')

In [103]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 22 columns):
0     100000 non-null float64
1     100000 non-null float64
2     100000 non-null float64
3     100000 non-null float64
4     100000 non-null float64
5     100000 non-null float64
6     100000 non-null float64
7     100000 non-null float64
8     100000 non-null float64
9     100000 non-null float64
10    100000 non-null float64
11    100000 non-null float64
12    100000 non-null float64
13    100000 non-null float64
14    100000 non-null float64
15    100000 non-null float64
16    100000 non-null float64
17    100000 non-null float64
18    100000 non-null float64
19    100000 non-null float64
B     100000 non-null object
C     100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(20), object(1)
memory usage: 17.5+ MB

追加

In [85]: def f1():
   ....:     result = df
   ....:     for i in range(9):
   ....:         result = result.append(df)
   ....:     return result
   ....: 

Concat

In [86]: def f2():
   ....:     result = []
   ....:     for i in range(10):
   ....:         result.append(df)
   ....:     return pd.concat(result)
   ....: 

In [100]: f1().equals(f2())
Out[100]: True

In [101]: %timeit f1()
1 loops, best of 3: 1.66 s per loop

In [102]: %timeit f2()
1 loops, best of 3: 220 ms per loop

请注意,我什至都不会尝试进行预分配.它有些复杂,尤其是因为您要处理多个dtypes(例如,您可以制作一个巨大的框架,只需.loc即可使用).但是pd.concat只是简单,可靠且快速地工作.

Note that I wouldn't even bother trying to pre-allocate. Its somewhat complicated, especially since you are dealing with multiple dtypes (e.g. you could make a giant frame and simply .loc and it would work). But pd.concat is just dead simple, works reliably, and fast.

以及从上方开始选择尺寸的时间

And timing of your sizes from above

In [104]: df = DataFrame(np.random.randn(2500,40))

In [105]: %timeit f1()
10 loops, best of 3: 33.1 ms per loop

In [106]: %timeit f2()
100 loops, best of 3: 4.23 ms per loop

这篇关于创建大型Pandas DataFrame:预分配vs追加vs concat的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆