如何在不复制数据的情况下串联 pandas DataFrame? [英] How do I concatenate pandas DataFrames without copying the data?
问题描述
我要连接两个熊猫DataFrame,而不复制数据.也就是说,我希望串联的DataFrame是两个原始DataFrame中数据的视图.我尝试使用concat(),但是没有用.此代码块显示更改基础数据会影响串联的两个DataFrame,但不会影响串联的DataFrame:
arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)
这是最后五行的输出.在将新值分配给arr [0,0]之后df改变了; df_concat不受影响.
In [56]: pp(df)
VALE5 PETR4
2013-01-01 -0.557180 0.170073
2013-01-02 -0.975797 0.763136
2013-01-03 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442
2013-01-06 -0.323640 0.024857
In [57]: pp(df_concat)
A B
VALE5 PETR4 AMBV3 BBDC4
2013-01-01 -0.557180 0.170073 -0.557180 0.170073
2013-01-02 -0.975797 0.763136 -0.975797 0.763136
2013-01-03 -0.913254 1.042521 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442 -1.259005 1.448442
2013-01-06 -0.323640 0.024857 -0.323640 0.024857
In [58]: arr[0, 0] = 9999999.99
In [59]: pp(df)
VALE5 PETR4
2013-01-01 9999999.990000 0.170073
2013-01-02 -0.975797 0.763136
2013-01-03 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442
2013-01-06 -0.323640 0.024857
In [60]: pp(df_concat)
A B
VALE5 PETR4 AMBV3 BBDC4
2013-01-01 -0.557180 0.170073 -0.557180 0.170073
2013-01-02 -0.975797 0.763136 -0.975797 0.763136
2013-01-03 -0.913254 1.042521 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442 -1.259005 1.448442
2013-01-06 -0.323640 0.024857 -0.323640 0.024857
我想这意味着concat()创建了数据的副本.有办法避免制作副本吗? (我想最大程度地减少内存使用量.)
还有,有没有一种快速的方法来检查两个DataFrame是否链接到相同的基础数据? (省去了更改数据和检查每个DataFrame是否已更改的麻烦)
感谢您的帮助.
FS
您不能(至少很容易).当您调用concat
时,最终会调用np.concatenate
.
请参见此答案,说明为什么不能在不复制的情况下连接数组.简短的是,不能保证数组在内存中是连续的.
这是一个简单的示例
a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z
输出:
x.base is a and y.base is a == True
x.base is z or y.base is z == False
即使x
和y
共享相同的base
,即a
,concatenate
(因此是vstack
)也不能假设它们这样做,因为人们经常想连接任意跨度的数组. /p>
您可以轻松地生成两个步长不同的两个数组,它们共享相同的内存,如下所示:
a = arange(10)
b = a[::2]
print a.strides
print b.strides
输出:
(8,)
(16,)
这就是为什么发生以下情况的原因:
In [214]: a = arange(10)
In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)
ValueError: new type not compatible with array.
In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)
编辑:df1.dtype != df2.dtype
时使用pd.merge(df1, df2, copy=False)
(或df1.merge(df2, copy=False)
)将不会进行复制.否则,将进行复制.
I want to concatenate two pandas DataFrames without copying the data. That is, I want the concatenated DataFrame to be a view on the data in the two original DataFrames. I tried using concat() and that did not work. This block of code shows that changing the underlying data affects the two DataFrames that are concatenated but not the concatenated DataFrame:
arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)
This is the output of the last five lines. df changed after a new value was assigned to arr[0, 0]; df_concat was not affected.
In [56]: pp(df)
VALE5 PETR4
2013-01-01 -0.557180 0.170073
2013-01-02 -0.975797 0.763136
2013-01-03 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442
2013-01-06 -0.323640 0.024857
In [57]: pp(df_concat)
A B
VALE5 PETR4 AMBV3 BBDC4
2013-01-01 -0.557180 0.170073 -0.557180 0.170073
2013-01-02 -0.975797 0.763136 -0.975797 0.763136
2013-01-03 -0.913254 1.042521 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442 -1.259005 1.448442
2013-01-06 -0.323640 0.024857 -0.323640 0.024857
In [58]: arr[0, 0] = 9999999.99
In [59]: pp(df)
VALE5 PETR4
2013-01-01 9999999.990000 0.170073
2013-01-02 -0.975797 0.763136
2013-01-03 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442
2013-01-06 -0.323640 0.024857
In [60]: pp(df_concat)
A B
VALE5 PETR4 AMBV3 BBDC4
2013-01-01 -0.557180 0.170073 -0.557180 0.170073
2013-01-02 -0.975797 0.763136 -0.975797 0.763136
2013-01-03 -0.913254 1.042521 -0.913254 1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005 1.448442 -1.259005 1.448442
2013-01-06 -0.323640 0.024857 -0.323640 0.024857
I guess this means concat() created a copy of the data. Is there a way to avoid a copy being made? (I want to minimize memory usage).
Also, is there a fast way to check if two DataFrames are linked to the same underlying data? (short of going through the trouble of changing the data and checking if each DataFrame has changed)
Thanks for the help.
FS
You can't (at least easily). When you call concat
, ultimately np.concatenate
gets called.
See this answer explaining why you can't concatenate arrays without copying. The short of it is that the arrays are not guaranteed to be contiguous in memory.
Here's a simple example
a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z
Output:
x.base is a and y.base is a == True
x.base is z or y.base is z == False
Even though x
and y
share the same base
, namely a
, concatenate
(and thus vstack
) cannot assume that they do since one often wants to concatenate arbitrarily strided arrays.
You easily generate two arrays with different strides sharing the same memory like so:
a = arange(10)
b = a[::2]
print a.strides
print b.strides
Output:
(8,)
(16,)
This is why the following happens:
In [214]: a = arange(10)
In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)
ValueError: new type not compatible with array.
In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)
EDIT: Using pd.merge(df1, df2, copy=False)
(or df1.merge(df2, copy=False)
) when df1.dtype != df2.dtype
will not make a copy. Otherwise, a copy is made.
这篇关于如何在不复制数据的情况下串联 pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!