如何在不复制数据的情况下串联 pandas DataFrame? [英] How do I concatenate pandas DataFrames without copying the data?

查看:128
本文介绍了如何在不复制数据的情况下串联 pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要连接两个熊猫DataFrame,而不复制数据.也就是说,我希望串联的DataFrame是两个原始DataFrame中数据的视图.我尝试使用concat(),但是没有用.此代码块显示更改基础数据会影响串联的两个DataFrame,但不会影响串联的DataFrame:

arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)

这是最后五行的输出.在将新值分配给arr [0,0]之后df改变了; df_concat不受影响.

In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857

In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

In [58]: arr[0, 0] = 9999999.99

In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857

In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

我想这意味着concat()创建了数据的副本.有办法避免制作副本吗? (我想最大程度地减少内存使用量.)

还有,有没有一种快速的方法来检查两个DataFrame是否链接到相同的基础数据? (省去了更改数据和检查每个DataFrame是否已更改的麻烦)

感谢您的帮助.

FS

解决方案

您不能(至少很容易).当您调用concat时,最终会调用np.concatenate.

请参见此答案,说明为什么不能在不复制的情况下连接数组.简短的是,不能保证数组在内存中是连续的.

这是一个简单的示例

a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z

输出:

x.base is a and y.base is a == True
x.base is z or y.base is z == False

即使xy共享相同的base,即aconcatenate(因此是vstack)也不能假设它们这样做,因为人们经常想连接任意跨度的数组. /p>

您可以轻松地生成两个步长不同的两个数组,它们共享相同的内存,如下所示:

a = arange(10)
b = a[::2]
print a.strides
print b.strides

输出:

(8,)
(16,)

这就是为什么发生以下情况的原因:

In [214]: a = arange(10)

In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)

ValueError: new type not compatible with array.

In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)

编辑:df1.dtype != df2.dtype时使用pd.merge(df1, df2, copy=False)(或df1.merge(df2, copy=False))将不会进行复制.否则,将进行复制.

I want to concatenate two pandas DataFrames without copying the data. That is, I want the concatenated DataFrame to be a view on the data in the two original DataFrames. I tried using concat() and that did not work. This block of code shows that changing the underlying data affects the two DataFrames that are concatenated but not the concatenated DataFrame:

arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)

This is the output of the last five lines. df changed after a new value was assigned to arr[0, 0]; df_concat was not affected.

In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857

In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

In [58]: arr[0, 0] = 9999999.99

In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857

In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

I guess this means concat() created a copy of the data. Is there a way to avoid a copy being made? (I want to minimize memory usage).

Also, is there a fast way to check if two DataFrames are linked to the same underlying data? (short of going through the trouble of changing the data and checking if each DataFrame has changed)

Thanks for the help.

FS

解决方案

You can't (at least easily). When you call concat, ultimately np.concatenate gets called.

See this answer explaining why you can't concatenate arrays without copying. The short of it is that the arrays are not guaranteed to be contiguous in memory.

Here's a simple example

a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z

Output:

x.base is a and y.base is a == True
x.base is z or y.base is z == False

Even though x and y share the same base, namely a, concatenate (and thus vstack) cannot assume that they do since one often wants to concatenate arbitrarily strided arrays.

You easily generate two arrays with different strides sharing the same memory like so:

a = arange(10)
b = a[::2]
print a.strides
print b.strides

Output:

(8,)
(16,)

This is why the following happens:

In [214]: a = arange(10)

In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)

ValueError: new type not compatible with array.

In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)

EDIT: Using pd.merge(df1, df2, copy=False) (or df1.merge(df2, copy=False)) when df1.dtype != df2.dtype will not make a copy. Otherwise, a copy is made.

这篇关于如何在不复制数据的情况下串联 pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆