比较两个 pandas 数据帧的行的最快方法? [英] Fastest way to compare rows of two pandas dataframes?

查看:48
本文介绍了比较两个 pandas 数据帧的行的最快方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有两个熊猫数据帧A和B.

So I have two pandas dataframes, A and B.

A是1000行x 500列,填充有表示存在或不存在的二进制值.

A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.

B是1024行x 10列,并且是0和1的完整迭代,因此具有1024行.

B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.

我正在尝试查找A中特定的10列中A中的哪些行与B中的给定行相对应.我需要整行进行匹配,而不是逐个元素匹配.

I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.

例如,我想要

A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]

要返回A中的行(3,5,8,11,15)与B中的(1,0,1,0,1,0,0,1,0,0)行在那些特定列(1,2,3,4,5,6,7,8,9,10)

To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)

我想对B中的每一行进行此操作. 我能弄清楚的最好方法是:

And I want to do this over every row in B. The best way I could figure out to do this was:

import numpy as np
for i in B:
    B_array = np.array(i)
    Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
    Matching_Rows_Index = Matching_Rows.index

这对于一个实例来说并不可怕,但是我在一个运行大约20,000次的while循环中使用了它;因此,它会使速度大大降低.

This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.

我一直在搞乱DataFrame.apply,但无济于事.地图可以更好地工作吗?

I have been messing around with DataFrame.apply to no avail. Could map work better?

我只是希望有人看到一些效率更高的东西,因为我是python的新手.

I was just hoping someone saw something obviously more efficient as I am fairly new to python.

感谢和问候!

推荐答案

您可以使用 merge reset_index -输出是B的索引,这些索引在自定义列中的A中相等:

You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:

A = pd.DataFrame({'A':[1,0,1,1],
                  'B':[0,0,1,1],
                  'C':[1,0,1,1],
                  'D':[1,1,1,0],
                  'E':[1,1,0,1]})

print (A)
   A  B  C  D  E
0  1  0  1  1  1
1  0  0  0  1  1
2  1  1  1  1  0
3  1  1  1  0  1

B = pd.DataFrame({'0':[1,0,1],
                  '1':[1,0,1],
                  '2':[1,0,0]})

print (B)
   0  1  2
0  1  1  1
1  0  0  0
2  1  1  0

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A')))

   index_B  0  1  2  index_A  A  B  C  D  E
0        0  1  1  1        2  1  1  1  1  0
1        0  1  1  1        3  1  1  1  0  1
2        1  0  0  0        1  0  0  0  1  1    

print (pd.merge(B.reset_index(), 
                A.reset_index(), 
                left_on=B.columns.tolist(), 
                right_on=A.columns[[0,1,2]].tolist(),
                suffixes=('_B','_A'))[['index_B','index_A']])    

   index_B  index_A
0        0        2
1        0        3
2        1        1   

这篇关于比较两个 pandas 数据帧的行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆