使用Pandas将列从一个DataFrame复制到另一个的最快方法? [英] Fastest way to copy columns from one DataFrame to another using pandas?

查看:3388
本文介绍了使用Pandas将列从一个DataFrame复制到另一个的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的DataFrame(一百万个)记录,用于存储数据的核心(如数据库),然后有一个较小的DataFrame(1到2000个)记录,我将其中的一些我程序中每个时间步的列,可以是几千个时间步.通过id列以相同的方式对两个DataFrame进行索引.

I have a large DataFrame (million +) records I'm using to store core of my data (like a database) and I then have a smaller DataFrame (1 to 2000) records that I'm combining a few of the columns for each time step in my program which can be several thousand time steps . Both DataFrames are indexed the same way by a id column.

我正在使用的代码是:

df_large.loc[new_ids, core_cols] = df_small.loc[new_ids, core_cols]

其中core_cols是我要处理的大约10个字段的列表,而new_id是小DataFrame中的ID.这段代码可以正常工作,但是它是我的代码中最慢的部分,大约是3级.我只是想知道它们是否是将两个DataFrame的数据合并在一起的更快方法.

Where core_cols is a list of about 10 fields that I'm coping over and new_ids are the ids from the small DataFrame. This code works fine but it is the slowest part of my code my a magnitude of three. I just wanted to know if they was a faster way to merge the data of the two DataFrame together.

我每次都尝试使用merge函数合并数据,但是过程花了很长时间,这就是我创建一个更大的DataFrame的方法,我对其进行更新以提高速度.

I tried merging the data each time with the merge function but process took way to long that is way I have gone to creating a larger DataFrame that I update to improve the speed.

推荐答案

使用.loc设置可对齐的框架本质上没有慢,尽管它确实需要一些代码来涵盖很多情况,所以紧密循环可能不是理想的选择.仅供参考,此示例与第二个示例稍有不同.

There is nothing inherently slow about using .loc to set with an alignable frame, though it does go through a bit of code to cover lot of cases, so probably it's not ideal to have in a tight loop. FYI, this example is slightly different that the 2nd example.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: from pandas import DataFrame

In [4]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])

In [5]: df
Out[5]: 
   0  1  2
a  1  1  1
b  1  1  1
c  1  1  1
d  1  1  1
e  1  1  1
f  1  1  1
g  1  1  1
h  1  1  1
i  1  1  1
j  1  1  1

[10 rows x 3 columns]

In [6]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])

In [7]: df2
Out[7]: 
   1  2
a  0  0
f  0  0
g  0  0

[3 rows x 2 columns]

In [8]: df.loc[df2.index,df2.columns] = df2

In [9]: df
Out[9]: 
   0  1  2
a  1  0  0
b  1  1  1
c  1  1  1
d  1  1  1
e  1  1  1
f  1  0  0
g  1  0  0
h  1  1  1
i  1  1  1
j  1  1  1

[10 rows x 3 columns]

这是另一种选择.它可能适合也可能不适合您的数据模式.如果更新(您的小框架)非常独立,那么它将起作用(IOW,您不更新大框架,然后选择一个新的子框架,然后进行更新,等等.-如果这是您的模式,则使用大约是对的.)

Here's an alternative. It may or may not fit your data pattern. If the updates (your small frame) are pretty much independent this would work (IOW you are not updating the big frame, then picking out a new sub-frame, then updating, etc. - if this is your pattern, then using .loc is about right).

不是更新大框架,而是使用大框架中的列更新小框架,例如:

Instead of updating the big frame, update the small frame with the columns from the big frame, e.g.:

In [10]: df = DataFrame(1.,index=list('abcdefghij'),columns=[0,1,2])

In [11]: df2 = DataFrame(0,index=list('afg'),columns=[1,2])

In [12]: needed_columns = df.columns-df2.columns

In [13]: df2[needed_columns] = df.reindex(index=df2.index,columns=needed_columns)

In [14]: df2
Out[14]: 
   1  2  0
a  0  0  1
f  0  0  1
g  0  0  1

[3 rows x 3 columns]

In [15]: df3 = DataFrame(0,index=list('cji'),columns=[1,2])

In [16]: needed_columns = df.columns-df3.columns

In [17]: df3[needed_columns] = df.reindex(index=df3.index,columns=needed_columns)

In [18]: df3
Out[18]: 
   1  2  0
c  0  0  1
j  0  0  1
i  0  0  1

[3 rows x 3 columns]

并在需要时将所有内容合并在一起(同时将它们保存在列表中,或者在下面查看我的评论,这些子帧可以在创建时移至外部存储,然后在此合并步骤之前回读).

And concat everything together when you want (they are kept in a list in the mean time, or see my comments below, these sub-frames could be moved to external storage when created, then read back before this concatenating step).

In [19]: pd.concat([ df.reindex(index=df.index-df2.index-df3.index), df2, df3]).reindex_like(df)
Out[19]: 
   0  1  2
a  1  0  0
b  1  1  1
c  1  0  0
d  1  1  1
e  1  1  1
f  1  0  0
g  1  0  0
h  1  1  1
i  1  0  0
j  1  0  0

[10 rows x 3 columns]

此模式的优点在于它可以轻松扩展为使用实际的db(或者更好的是HDFStore)来实际存储数据库",然后根据需要创建/更新子帧,然后写出完成后转到新商店.

The beauty of this pattern is that it is easily extended to using an actual db (or much better an HDFStore), to actually store the 'database', then creating/updating sub-frames as needed, then writing out to a new store when finished.

我一直使用这种模式,尽管实际上是使用面板.

I use this pattern all of the time, though with Panels actually.

  • 对数据的子集执行计算,并将每个子集写入单独的文件
  • 然后最后将它们全部读入并存储在内存中,并写出一个巨大的新文件. concat步骤可以在内存中一次完成,或者如果确实是一项大任务,则可以迭代完成.

我能够使用多个进程来执行我的计算,并将每个单独的Pan​​el单独写入一个文件,因为它们都是完全独立的.唯一依赖的部分是concat.

I am able to use multi-processes to perform my computations AND write each individual Panel to a file separate as they are all completely independent. The only dependent part is the concat.

这本质上是一种map-reduce模式.

This is essentially a map-reduce pattern.

这篇关于使用Pandas将列从一个DataFrame复制到另一个的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆