pandas DataFrame concat与追加 [英] Pandas DataFrame concat vs append

查看:48
本文介绍了 pandas DataFrame concat与追加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个4个熊猫数据框的列表,其中包含我想合并为一个数据框的一天的滴答数据.我无法理解concat在时间戳上的行为.查看以下详细信息:

I have a list of 4 pandas dataframes containing a day of tick data that I want to merge into a single data frame. I cannot understand the behavior of concat on my timestamps. See details below:

data

[<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 35228 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-03-28 18:59:20.357000+02:00
Data columns:
Price       4040  non-null values
Volume      4040  non-null values
BidQty      35228  non-null values
BidPrice    35228  non-null values
AskPrice    35228  non-null values
AskQty      35228  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 33088 entries, 2013-04-01 00:03:17.047000+02:00 to 2013-04-01 18:59:58.175000+02:00
Data columns:
Price       3969  non-null values
Volume      3969  non-null values
BidQty      33088  non-null values
BidPrice    33088  non-null values
AskPrice    33088  non-null values
AskQty      33088  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 50740 entries, 2013-04-02 00:03:27.470000+02:00 to 2013-04-02 18:59:58.172000+02:00
Data columns:
Price       7326  non-null values
Volume      7326  non-null values
BidQty      50740  non-null values
BidPrice    50740  non-null values
AskPrice    50740  non-null values
AskQty      50740  non-null values
dtypes: float64(6),
<class 'pandas.core.frame.DataFrame'>

DatetimeIndex: 60799 entries, 2013-04-03 00:03:06.994000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
Price       8258  non-null values
Volume      8258  non-null values
BidQty      60799  non-null values
BidPrice    60799  non-null values
AskPrice    60799  non-null values
AskQty      60799  non-null values
dtypes: float64(6)]

使用append我得到:

pd.DataFrame().append(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-28 00:00:07.089000+02:00 to 2013-04-03 18:59:58.180000+02:00
Data columns:
AskPrice    179855  non-null values
AskQty      179855  non-null values
BidPrice    179855  non-null values
BidQty      179855  non-null values
Price       23593  non-null values
Volume      23593  non-null values
dtypes: float64(6)

使用concat我得到:

pd.concat(data)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 179855 entries, 2013-03-27 22:00:07.089000+02:00 to 2013-04-03 16:59:58.180000+02:00
Data columns:
Price       23593  non-null values
Volume      23593  non-null values
BidQty      179855  non-null values
BidPrice    179855  non-null values
AskPrice    179855  non-null values
AskQty      179855  non-null values
dtypes: float64(6)

注意使用concat时索引如何变化.为什么会发生这种情况,我将如何使用concat重现使用append获得的结果? (因为concat看起来要快得多;每个循环24.6 ms,而每个循环3.02 s)

Notice how the index changes when using concat. Why is that happening and how would I go about using concat to reproduce the results obtained using append? (Since concat seems so much faster; 24.6 ms per loop vs 3.02 s per loop)

推荐答案

所以您正在执行的操作是append和concat与几乎等价.区别在于空的DataFrame.由于某种原因,这会导致严重的减速,不确定确切的原因,必须要考虑一下.以下是对您所做工作的重新介绍.

So what are you doing is with append and concat is almost equivalent. The difference is the empty DataFrame. For some reason this causes a big slowdown, not sure exactly why, will have to look at some point. Below is a recreation of basically what you did.

我几乎总是使用concat(尽管在这种情况下,它们是等效的,除了空白框外); 如果您不使用空框,则它们的速度将相同.

I almost always use concat (though in this case they are equivalent, except for the empty frame); if you don't use the empty frame they will be the same speed.

In [17]: df1 = pd.DataFrame(dict(A = range(10000)),index=pd.date_range('20130101',periods=10000,freq='s'))

In [18]: df1
Out[18]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10000 entries, 2013-01-01 00:00:00 to 2013-01-01 02:46:39
Freq: S
Data columns (total 1 columns):
A    10000  non-null values
dtypes: int64(1)

In [19]: df4 = pd.DataFrame()

The concat

In [20]: %timeit pd.concat([df1,df2,df3])
1000 loops, best of 3: 270 us per loop

This is equavalent of your append

In [21]: %timeit pd.concat([df4,df1,df2,df3])
10 loops, best of 

 3: 56.8 ms per loop

这篇关于 pandas DataFrame concat与追加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆