比较两个数据帧并过滤匹配的值 [英] Compare two dataframes and filter the matched values

查看:180
本文介绍了比较两个数据帧并过滤匹配的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上一个问题:熊猫比较两个数据框,并确定匹配值



我有两个数据框:

  print (a)

ID值
0 AA12 101 BB101 CC01 DD06 1
1 AA12 101 BB101 CC01 DD06 2
2 AA11 102 BB101 CC01 2341 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 1231 4
5 AA14 203 BB101 CC02 4561 5
打印(b)

ID值
0 AA12 101 BB101 CC01 1351 DD06 1
1 AA12 101 BB101 CC01 1351 DD06 2
2 AA11 102 BB101 CC01 DD07 2
3 AA10 202 BB101 CC01 3451 DD09 3
4 AA13 103 BB101 CC02 4
5 AA14 203 BB101 CC02 4561 6

输出

  ID值匹配?价值匹配? 
0 AA12 101 BB101 CC01 DD06 1是是
1 AA12 101 BB101 CC01 DD06 2是是
2 AA11 102 BB101 CC01 2341 DD07 2是是
3 AA10 202 BB101 CC01 3451 DD09 3是是
4 AA13 103 BB101 CC02 1231 4否是
5 AA14 203 BB101 CC02 4561 5是否

以下是@MaxU从上一篇文章中写的代码:

  pd.merge .assign(x = a.ID.str.split()。apply(sorted).str.join('')),
b.assign(x = b.ID.str.split()。apply (排序).str.join('')),
on = ['x','Value'],
how ='outer',
indicator = True)

我想要实现




  • 如果任何数据帧不包含['ID']
    (即2341,3451)的四位数项目,我想要e

  • 如果相同的ID出现不止一次,则可以在
    ['Value']上具有不同的值。



此代码的结果是 here 。不幸的是,它没有达到预期的结果。只有索引3匹配。我正在调整代码,但无法弄清楚下一步。



非常感谢您的时间和考虑!

解决方案

尝试这样:



首先让我们在两个DF中拆分和堆栈ID列:

 在[248]中:d1 = df1.set_index('Value')。ID.str.split(expand = True).stack()。to_frame ('ID')。reset_index()。rename(columns = {'level_1':'idx'})
...:d2 = df2.set_index('Value')。ID.str.split = true).stack()。to_frame('ID')。reset_index()。rename(columns = {'level_1':'idx'})

在[249]中:d1
出[249]:
值idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 DD06
5 2 0 AA12
6 2 1 101
7 2 2 BB101
8 2 3 CC01
9 2 4 DD06
10 2 0 AA11
11 2 1 102
12 2 2 BB101
13 2 3 CC01
14 2 4 2341
15 2 5 DD07
16 3 0 AA 10
17 3 1 202
18 3 2 BB101
19 3 3 CC01
20 3 4 3451
21 3 5 DD09
22 4 0 AA13
23 4 1 103
24 4 2 BB101
25 4 3 CC02
26 4 4 1231
27 5 0 AA14
28 5 1 203
29 5 2 BB101
30 5 3 CC02
31 5 4 4561

在[250]中:d2
输出[250]:
值idx ID
0 1 0 AA12
1 1 1 101
2 1 2 BB101
3 1 3 CC01
4 1 4 1351
5 1 5 DD06
6 2 0 AA12
7 2 1 101
8 2 2 BB101
9 2 3 CC01
10 2 4 1351
11 2 5 DD06
12 2 0 AA11
13 2 1 102
14 2 2 BB1 01
15 2 3 CC01
16 2 4 DD07
17 3 0 AA10
18 3 1 202
19 3 2 BB101
20 3 3 CC01
21 3 4 3451
22 3 5 DD09
23 4 0 AA13
24 4 1 103
25 4 2 BB101
26 4 3 CC02
27 6 0 AA14
28 6 1 203
29 6 2 BB101
30 6 3 CC02
31 6 4 4561
pre>

现在我们可以搜索'不匹配' ID:

 在[251]中:no_match_idx = d1.loc [〜d1.ID.isin(d2.ID),'idx']。unique()

在[252]中:no_match_idx
Out [252]:array([4],dtype = int64)

在[253]中:df1 ['Matched_ID'] =〜df1 .index.isin(no_match_idx)
...:df1 ['Matched_Value'] = df1.Value.isin(df2.Value)

结果:

 在[254]中:df1 
输出[254 ]:
ID值Matched_ID Matched_Value
0 AA12 101 BB101 CC01 DD06 1 True True
1 AA12 101 BB101 CC01 DD06 2 True True
2 AA11 102 BB101 CC01 2341 DD07 2 True True
3 AA10 202 BB101 CC01 3451 DD09 3 True True
4 AA13 103 BB101 CC02 1231 4 False True
5 AA14 203 BB101 CC02 4561 5 True False


Previous question: Pandas Compare two dataframes and determine the matched values

I have two dataframes:

print(a)

    ID                              Value
0   AA12 101 BB101 CC01 DD06        1
1   AA12 101 BB101 CC01 DD06        2
2   AA11 102 BB101 CC01 2341 DD07   2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02 1231        4
5   AA14 203 BB101 CC02 4561        5
print(b)

    ID                              Value
0   AA12 101 BB101 CC01 1351 DD06   1
1   AA12 101 BB101 CC01 1351 DD06   2
2   AA11 102 BB101 CC01 DD07        2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02             4
5   AA14 203 BB101 CC02 4561        6 

Desired output:

     ID                              Value   ID Matched?   Value Matched?
0    AA12 101 BB101 CC01 DD06        1       Yes           Yes 
1    AA12 101 BB101 CC01 DD06        2       Yes           Yes 
2    AA11 102 BB101 CC01 2341 DD07   2       Yes           Yes
3    AA10 202 BB101 CC01 3451 DD09   3       Yes           Yes
4    AA13 103 BB101 CC02 1231        4       No            Yes
5    AA14 203 BB101 CC02 4561        5       Yes           No

Here's the code written by @MaxU from the previous post:

pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
             b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
             on=['x','Value'],
             how='outer',
             indicator=True)

What I want to achieve:

  • If either dataframes do not contain four-digit item from ['ID'] (i.e. 2341, 3451), I want to exclude it from the matching process.
  • If Same ID appears more than once, they can have different values on ['Value'].

The result of this code is here. Unfortunately, it doesn't achieve the desired result. Only Index 3 gets matched. I was tweaking the code but couldn't figure out the next step.

Thanks you so much for your time and consideration!

解决方案

Try this:

first let's split and stack ID column in both DFs:

In [248]: d1 = df1.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
     ...: d2 = df2.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})

In [249]: d1
Out[249]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   DD06
5       2    0   AA12
6       2    1    101
7       2    2  BB101
8       2    3   CC01
9       2    4   DD06
10      2    0   AA11
11      2    1    102
12      2    2  BB101
13      2    3   CC01
14      2    4   2341
15      2    5   DD07
16      3    0   AA10
17      3    1    202
18      3    2  BB101
19      3    3   CC01
20      3    4   3451
21      3    5   DD09
22      4    0   AA13
23      4    1    103
24      4    2  BB101
25      4    3   CC02
26      4    4   1231
27      5    0   AA14
28      5    1    203
29      5    2  BB101
30      5    3   CC02
31      5    4   4561

In [250]: d2
Out[250]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   1351
5       1    5   DD06
6       2    0   AA12
7       2    1    101
8       2    2  BB101
9       2    3   CC01
10      2    4   1351
11      2    5   DD06
12      2    0   AA11
13      2    1    102
14      2    2  BB101
15      2    3   CC01
16      2    4   DD07
17      3    0   AA10
18      3    1    202
19      3    2  BB101
20      3    3   CC01
21      3    4   3451
22      3    5   DD09
23      4    0   AA13
24      4    1    103
25      4    2  BB101
26      4    3   CC02
27      6    0   AA14
28      6    1    203
29      6    2  BB101
30      6    3   CC02
31      6    4   4561

now we can search for 'not matched' IDs:

In [251]: no_match_idx = d1.loc[~d1.ID.isin(d2.ID), 'idx'].unique()

In [252]: no_match_idx
Out[252]: array([4], dtype=int64)

In [253]: df1['Matched_ID'] = ~df1.index.isin(no_match_idx)
     ...: df1['Matched_Value'] = df1.Value.isin(df2.Value)

Result:

In [254]: df1
Out[254]:
                              ID  Value Matched_ID Matched_Value
0       AA12 101 BB101 CC01 DD06      1       True          True
1       AA12 101 BB101 CC01 DD06      2       True          True
2  AA11 102 BB101 CC01 2341 DD07      2       True          True
3  AA10 202 BB101 CC01 3451 DD09      3       True          True
4       AA13 103 BB101 CC02 1231      4      False          True
5       AA14 203 BB101 CC02 4561      5       True         False

这篇关于比较两个数据帧并过滤匹配的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆