比较两个数据帧并过滤匹配的值 [英] Compare two dataframes and filter the matched values
问题描述
上一个问题:熊猫比较两个数据框,并确定匹配值
我有两个数据框:
print (a)
ID值
0 AA12 101 BB101 CC01 DD06 1
1 AA12 101 BB101 CC01 DD06 2
2 AA11 102 BB101 CC01 2341 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 1231 4
5 AA14 203 BB101 CC02 4561 5
打印(b)
ID值
0 AA12 101 BB101 CC01 1351 DD06 1
1 AA12 101 BB101 CC01 1351 DD06 2
2 AA11 102 BB101 CC01 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 4
5 AA14 203 BB101 CC02 4561 6
输出:
ID值匹配?价值匹配?
0 AA12 101 BB101 CC01 DD06 1是是
1 AA12 101 BB101 CC01 DD06 2是是
2 AA11 102 BB101 CC01 2341 DD07 2是是
3 AA10 202 BB101 CC01 3451 DD09 3是是
4 AA13 103 BB101 CC02 1231 4否是
5 AA14 203 BB101 CC02 4561 5是否
以下是@MaxU从上一篇文章中写的代码:
pd.merge .assign(x = a.ID.str.split()。apply(sorted).str.join('')),
b.assign(x = b.ID.str.split()。apply (排序).str.join('')),
on = ['x','Value'],
how ='outer',
indicator = True)
我想要实现:
- 如果任何数据帧不包含['ID']
(即2341,3451)的四位数项目,我想要e - 如果相同的ID出现不止一次,则可以在
['Value']上具有不同的值。
此代码的结果是 here 。不幸的是,它没有达到预期的结果。只有索引3匹配。我正在调整代码,但无法弄清楚下一步。
非常感谢您的时间和考虑!
尝试这样:
首先让我们在两个DF中拆分和堆栈ID列:
在[248]中:d1 = df1.set_index('Value')。ID.str.split(expand = True).stack()。to_frame ('ID')。reset_index()。rename(columns = {'level_1':'idx'})
pre>
...:d2 = df2.set_index('Value')。ID.str.split = true).stack()。to_frame('ID')。reset_index()。rename(columns = {'level_1':'idx'})
在[249]中:d1
出[249]:
值idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 DD06
5 2 0 AA12
6 2 1 101
7 2 2 BB101
8 2 3 CC01
9 2 4 DD06
10 2 0 AA11
11 2 1 102
12 2 2 BB101
13 2 3 CC01
14 2 4 2341
15 2 5 DD07
16 3 0 AA 10
17 3 1 202
18 3 2 BB101
19 3 3 CC01
20 3 4 3451
21 3 5 DD09
22 4 0 AA13
23 4 1 103
24 4 2 BB101
25 4 3 CC02
26 4 4 1231
27 5 0 AA14
28 5 1 203
29 5 2 BB101
30 5 3 CC02
31 5 4 4561
在[250]中:d2
输出[250]:
值idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 1351
5 1 5 DD06
6 2 0 AA12
7 2 1 101
8 2 2 BB101
9 2 3 CC01
10 2 4 1351
11 2 5 DD06
12 2 0 AA11
13 2 1 102
14 2 2 BB1 01
15 2 3 CC01
16 2 4 DD07
17 3 0 AA10
18 3 1 202
19 3 2 BB101
20 3 3 CC01
21 3 4 3451
22 3 5 DD09
23 4 0 AA13
24 4 1 103
25 4 2 BB101
26 4 3 CC02
27 6 0 AA14
28 6 1 203
29 6 2 BB101
30 6 3 CC02
31 6 4 4561
现在我们可以搜索
'不匹配'
ID:在[251]中:no_match_idx = d1.loc [〜d1.ID.isin(d2.ID),'idx']。unique()
在[252]中:no_match_idx
Out [252]:array([4],dtype = int64)
在[253]中:df1 ['Matched_ID'] =〜df1 .index.isin(no_match_idx)
...:df1 ['Matched_Value'] = df1.Value.isin(df2.Value)
结果:
在[254]中:df1
输出[254 ]:
ID值Matched_ID Matched_Value
0 AA12 101 BB101 CC01 DD06 1 True True
1 AA12 101 BB101 CC01 DD06 2 True True
2 AA11 102 BB101 CC01 2341 DD07 2 True True
3 AA10 202 BB101 CC01 3451 DD09 3 True True
4 AA13 103 BB101 CC02 1231 4 False True
5 AA14 203 BB101 CC02 4561 5 True False
Previous question: Pandas Compare two dataframes and determine the matched values
I have two dataframes:
print(a) ID Value 0 AA12 101 BB101 CC01 DD06 1 1 AA12 101 BB101 CC01 DD06 2 2 AA11 102 BB101 CC01 2341 DD07 2 3 AA10 202 BB101 CC01 3451 DD09 3 4 AA13 103 BB101 CC02 1231 4 5 AA14 203 BB101 CC02 4561 5 print(b) ID Value 0 AA12 101 BB101 CC01 1351 DD06 1 1 AA12 101 BB101 CC01 1351 DD06 2 2 AA11 102 BB101 CC01 DD07 2 3 AA10 202 BB101 CC01 3451 DD09 3 4 AA13 103 BB101 CC02 4 5 AA14 203 BB101 CC02 4561 6
Desired output:
ID Value ID Matched? Value Matched? 0 AA12 101 BB101 CC01 DD06 1 Yes Yes 1 AA12 101 BB101 CC01 DD06 2 Yes Yes 2 AA11 102 BB101 CC01 2341 DD07 2 Yes Yes 3 AA10 202 BB101 CC01 3451 DD09 3 Yes Yes 4 AA13 103 BB101 CC02 1231 4 No Yes 5 AA14 203 BB101 CC02 4561 5 Yes No
Here's the code written by @MaxU from the previous post:
pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')), b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')), on=['x','Value'], how='outer', indicator=True)
What I want to achieve:
- If either dataframes do not contain four-digit item from ['ID'] (i.e. 2341, 3451), I want to exclude it from the matching process.
- If Same ID appears more than once, they can have different values on ['Value'].
The result of this code is here. Unfortunately, it doesn't achieve the desired result. Only Index 3 gets matched. I was tweaking the code but couldn't figure out the next step.
Thanks you so much for your time and consideration!
Try this:
first let's split and stack ID column in both DFs:
In [248]: d1 = df1.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
...: d2 = df2.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
In [249]: d1
Out[249]:
Value idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 DD06
5 2 0 AA12
6 2 1 101
7 2 2 BB101
8 2 3 CC01
9 2 4 DD06
10 2 0 AA11
11 2 1 102
12 2 2 BB101
13 2 3 CC01
14 2 4 2341
15 2 5 DD07
16 3 0 AA10
17 3 1 202
18 3 2 BB101
19 3 3 CC01
20 3 4 3451
21 3 5 DD09
22 4 0 AA13
23 4 1 103
24 4 2 BB101
25 4 3 CC02
26 4 4 1231
27 5 0 AA14
28 5 1 203
29 5 2 BB101
30 5 3 CC02
31 5 4 4561
In [250]: d2
Out[250]:
Value idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 1351
5 1 5 DD06
6 2 0 AA12
7 2 1 101
8 2 2 BB101
9 2 3 CC01
10 2 4 1351
11 2 5 DD06
12 2 0 AA11
13 2 1 102
14 2 2 BB101
15 2 3 CC01
16 2 4 DD07
17 3 0 AA10
18 3 1 202
19 3 2 BB101
20 3 3 CC01
21 3 4 3451
22 3 5 DD09
23 4 0 AA13
24 4 1 103
25 4 2 BB101
26 4 3 CC02
27 6 0 AA14
28 6 1 203
29 6 2 BB101
30 6 3 CC02
31 6 4 4561
now we can search for 'not matched'
IDs:
In [251]: no_match_idx = d1.loc[~d1.ID.isin(d2.ID), 'idx'].unique()
In [252]: no_match_idx
Out[252]: array([4], dtype=int64)
In [253]: df1['Matched_ID'] = ~df1.index.isin(no_match_idx)
...: df1['Matched_Value'] = df1.Value.isin(df2.Value)
Result:
In [254]: df1
Out[254]:
ID Value Matched_ID Matched_Value
0 AA12 101 BB101 CC01 DD06 1 True True
1 AA12 101 BB101 CC01 DD06 2 True True
2 AA11 102 BB101 CC01 2341 DD07 2 True True
3 AA10 202 BB101 CC01 3451 DD09 3 True True
4 AA13 103 BB101 CC02 1231 4 False True
5 AA14 203 BB101 CC02 4561 5 True False
这篇关于比较两个数据帧并过滤匹配的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!