pandas 加入vs添加列 [英] Pandas join vs add column
问题描述
我有2个数据帧(df1
和df2
),它们具有相同的MultiIndex
. df1
具有列A,df2
具有列B.
I have 2 dataframes (df1
and df2
) with the same MultiIndex
. df1
has column A, df2
has column B.
我发现了联接"这些数据框的两种方式:
I found 2 ways of 'joining' these dataframes:
df_joined = df1.join(df2, how='inner')
或
df1['B'] = df2['B']
第一个选项需要更长的时间.为什么? 选项2是否不查看索引,而只是附加"右侧的列?
First option takes much longer. Why? Does option 2 not look at indexes and just 'attaches' the column to the right?
随后运行此命令将返回True
,因此最终结果看起来是相同的,但这也许是因为df1
和df2
中的索引也处于相同的顺序:
Running this afterwards returns True
, so the end result is the same it seems, but perhaps this is because the indexes in df1
and df2
are also in the same order:
df_joined.equals(df1)
在索引相同的情况下,是否有任何更快的方法来连接数据框?
Is there any faster way to join the dataframes knowing the indexes are the same?
推荐答案
如果索引对齐,没有比df1['B'] = df2['B']
更快的方法了.
There is no faster way than df1['B'] = df2['B']
if indices are aligned.
在pandas
中已经很好地优化了将一个系列分配给另一个系列的操作.
Assigning a series to another series is already well optimised in pandas
.
join
比分配花费更长的时间,因为它显式地排列df1.index
和df2.index
,这很昂贵.不假定索引的顺序一致.根据 pd.DataFrame.join文档 ,如果未指定任何列,则join
将出现在数据框的相应索引上.
join
takes longer than assignment as it explicitly lines up df1.index
and df2.index
, which is expensive. It is not assumed that indices are in consistent order. As per pd.DataFrame.join documentation, if no column is specified the join
will take place on the dataframes' respective indices.
如果您发现这是工作流程中的瓶颈,我会感到惊讶.如果是这样,那么我建议您使用numpy
数组并完全避免使用pandas
.
I would be surprised if you find this is a bottleneck in your workflow. If it is, then I suggest you drop down to numpy
arrays and avoid pandas
altogether.
这篇关于 pandas 加入vs添加列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!