比较两个 pandas 数据帧的行的最快方法? [英] Fastest way to compare rows of two pandas dataframes?
问题描述
所以我有两个熊猫数据帧A和B.
So I have two pandas dataframes, A and B.
A是1000行x 500列,填充有表示存在或不存在的二进制值.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B是1024行x 10列,并且是0和1的完整迭代,因此具有1024行.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
我正在尝试查找A中特定的10列中A中的哪些行与B中的给定行相对应.我需要整行进行匹配,而不是逐个元素匹配.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
例如,我想要
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
要返回A中的行(3,5,8,11,15)
与B中的(1,0,1,0,1,0,0,1,0,0)
行在那些特定列(1,2,3,4,5,6,7,8,9,10)
To return something that rows (3,5,8,11,15)
in A match up with that (1,0,1,0,1,0,0,1,0,0)
row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
我想对B中的每一行进行此操作. 我能弄清楚的最好方法是:
And I want to do this over every row in B. The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
这对于一个实例来说并不可怕,但是我在一个运行大约20,000次的while循环中使用了它;因此,它会使速度大大降低.
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
我一直在搞乱DataFrame.apply,但无济于事.地图可以更好地工作吗?
I have been messing around with DataFrame.apply to no avail. Could map work better?
我只是希望有人看到一些效率更高的东西,因为我是python的新手.
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
感谢和问候!
推荐答案
您可以使用 merge
与 reset_index
-输出是B
的索引,这些索引在自定义列中的A
中相等:
You can use merge
with reset_index
- output are indexes of B
which are equal in A
in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
这篇关于比较两个 pandas 数据帧的行的最快方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!