根据多索引的多个级别有效地联接两个数据框 [英] Efficiently joining two dataframes based on multiple levels of a multiindex

查看:74
本文介绍了根据多索引的多个级别有效地联接两个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常有一个带有大索引的数据框,以及一个带有多索引的辅助数据框,该多索引是较大索引的子集.辅助数据帧通常是某种查找表.我经常想将查询表中的列添加到较大的数据框中.主要的DataFrame通常很大,因此我想高效地做到这一点.

I frequently have a dataframe with a large multiindex, and a secondary DataFrame with a multiindex that is a subset of the larger one. The secondary dataframe is usually some kind of lookup table. I often want to add the columns from the lookup table to the larger dataframe. The primary DataFrame is often very large, so I want to do this efficiently.

这是一个虚构的示例,我想将df2加入df1:

Here is an imaginary example, where I want to join df2 to df1:

   In [11]: arrays = [    ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],
   ....:               ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],
   ....:               ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]

In [12]: tuples = list(zip(*arrays))

In [13]: index = pd.MultiIndex.from_tuples(tuples, names=['Body', 'Season','Item'])

In [14]: df1 = pd.DataFrame(np.random.randn(8,2), index=index,columns=['A','B'])

In [15]: df1
Out[15]:
                          A         B
Body Season Item
sun  summer one   -0.121588  0.272774
     winter one    0.233562 -2.005623
            two   -1.034642  0.315065
moon summer one    0.184548  0.820873
            two    0.838290  0.495047
            three  0.450813 -2.040089
     winter one   -1.149993 -0.498148
            two    2.406824 -2.031849

[8 rows x 2 columns]


In [16]: index2= pd.MultiIndex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['Body','Season'])

In [17]: df2 = pd.DataFrame(['Good','Bad','Ugly','Confused'],index=index2,columns = ['Mood'])

In [18]: df2
Out[18]:
                 Mood
Body Season
sun  summer      Good
     winter       Bad
moon summer      Ugly
     winter  Confused

[4 rows x 1 columns]

现在,假设我想将df2中的列添加到df1中?这行是我找到工作的唯一途径:

Now, suppose I want to add the columns from df2 to df1? This line is the only way I could find to do the job:

In [19]: df1 = df1.reset_index().join(df2,on=['Body','Season']).set_index(df1.index.names)

In [20]: df1
Out[20]:
                          A         B      Mood
Body Season Item
sun  summer one   -0.121588  0.272774      Good
     winter one    0.233562 -2.005623       Bad
            two   -1.034642  0.315065       Bad
moon summer one    0.184548  0.820873      Ugly
            two    0.838290  0.495047      Ugly
            three  0.450813 -2.040089      Ugly
     winter one   -1.149993 -0.498148  Confused
            two    2.406824 -2.031849  Confused

[8 rows x 3 columns]

它可以工作,但是这种方法有两个问题.首先,这行很丑.需要重置索引,然后重新创建多索引,使此简单操作显得不必要地复杂.其次,如果我理解正确,则每次我运行reset_index()和set_index()时,都会创建数据帧的副本.我经常使用非常大的数据框,这似乎效率很低.

It works, but there are two problems with this method. First, the line is ugly. Needing to reset the index, then recreate the multiindex, makes this simple operation seem needlessly complicated. Second, if I understand correctly, every time I run reset_index() and set_index(), a copy of the dataframe is created. I am often working with very large dataframes, and this seems very inefficient.

有更好的方法吗?

推荐答案

这不是ATM内部实现的方法,但是建议您使用soln,请参见

This is not implemented internally ATM, but your soln is the recommended one, see here as well the issue

如果您想使它看起来更好,可以简单地将其包装在一个函数中. reset_index/set_index进行复制(尽管您可以根据需要传递inplace=True参数);它确实存在,因为它们只是在更改索引属性.

You can simply wrap this in a function if you want to make it look nicer. reset_index/set_index do copy (though you can pass an inplace=True argument if you want); it IS truly inplace as these are just changing the index attribute.

您可以修补一个不错的功能,例如:

You could patch in a nice function like:

def merge_multi(self, df, on):
    return self.reset_index().join(df,on=on).set_index(self.index.names)
DataFrame.merge_multi = merge_multi

df1.merge_multi(df2,on=['Body','Season'])

但是,按定义合并会创建新数据,因此不确定会为您节省多少费用.

However, merging by definition creates new data, so not sure how much this will actually save you.

更好的方法是建立较小的框架,然后进行较大的合并.您可能还想做类似 this

A better method is to build up smaller frames, then do a larger merge. You also might want to do something like this

这篇关于根据多索引的多个级别有效地联接两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆