比较两个数据帧并过滤匹配的值 [英] Compare two dataframes and filter the matched values

查看：180 发布时间：2017/3/26 2:19:32 python pandas dataframe

本文介绍了比较两个数据帧并过滤匹配的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有两个数据框：

  print （a）
 
 ID值
 0 AA12 101 BB101 CC01 DD06 1 
 1 AA12 101 BB101 CC01 DD06 2 
 2 AA11 102 BB101 CC01 2341 DD07 2 
 3 AA10 202 BB101 CC01 3451 DD09 3 
 4 AA13 103 BB101 CC02 1231 4 
 5 AA14 203 BB101 CC02 4561 5 
打印（b）
 
 ID值
 0 AA12 101 BB101 CC01 1351 DD06 1 
 1 AA12 101 BB101 CC01 1351 DD06 2 
 2 AA11 102 BB101 CC01 DD07 2 
 3 AA10 202 BB101 CC01 3451 DD09 3 
 4 AA13 103 BB101 CC02 4 
 5 AA14 203 BB101 CC02 4561 6

输出：

  ID值匹配？价值匹配？ 
 0 AA12 101 BB101 CC01 DD06 1是是
 1 AA12 101 BB101 CC01 DD06 2是是
 2 AA11 102 BB101 CC01 2341 DD07 2是是
 3 AA10 202 BB101 CC01 3451 DD09 3是是
 4 AA13 103 BB101 CC02 1231 4否是
 5 AA14 203 BB101 CC02 4561 5是否

以下是@MaxU从上一篇文章中写的代码：

  pd.merge .assign（x = a.ID.str.split（）。apply（sorted）.str.join（''）），
 b.assign（x = b.ID.str.split（）。apply （排序）.str.join（''）），
 on = ['x'，'Value']，
 how ='outer'，
 indicator = True）

我想要实现：

如果任何数据帧不包含['ID']
（即2341,3451）的四位数项目，我想要e

如果相同的ID出现不止一次，则可以在
['Value']上具有不同的值。

此代码的结果是 here 。不幸的是，它没有达到预期的结果。只有索引3匹配。我正在调整代码，但无法弄清楚下一步。

非常感谢您的时间和考虑！

解决方案

尝试这样：

首先让我们在两个DF中拆分和堆栈ID列：

在[248]中：d1 = df1.set_index（'Value'）。ID.str.split（expand = True）.stack（）。to_frame （'ID'）。reset_index（）。rename（columns = {'level_1'：'idx'}） ...：d2 = df2.set_index（'Value'）。ID.str.split = true）.stack（）。to_frame（'ID'）。reset_index（）。rename（columns = {'level_1'：'idx'}）在[249]中：d1 出[249]：值idx ID 0 1 0 AA12 1 1 1 101 2 1 2 BB101 3 1 3 CC01 4 1 4 DD06 5 2 0 AA12 6 2 1 101 7 2 2 BB101 8 2 3 CC01 9 2 4 DD06 10 2 0 AA11 11 2 1 102 12 2 2 BB101 13 2 3 CC01 14 2 4 2341 15 2 5 DD07 16 3 0 AA 10 17 3 1 202 18 3 2 BB101 19 3 3 CC01 20 3 4 3451 21 3 5 DD09 22 4 0 AA13 23 4 1 103 24 4 2 BB101 25 4 3 CC02 26 4 4 1231 27 5 0 AA14 28 5 1 203 29 5 2 BB101 30 5 3 CC02 31 5 4 4561 在[250]中：d2 输出[250]：值idx ID 0 1 0 AA12 1 1 1 101 2 1 2 BB101 3 1 3 CC01 4 1 4 1351 5 1 5 DD06 6 2 0 AA12 7 2 1 101 8 2 2 BB101 9 2 3 CC01 10 2 4 1351 11 2 5 DD06 12 2 0 AA11 13 2 1 102 14 2 2 BB1 01 15 2 3 CC01 16 2 4 DD07 17 3 0 AA10 18 3 1 202 19 3 2 BB101 20 3 3 CC01 21 3 4 3451 22 3 5 DD09 23 4 0 AA13 24 4 1 103 25 4 2 BB101 26 4 3 CC02 27 6 0 AA14 28 6 1 203 29 6 2 BB101 30 6 3 CC02 31 6 4 4561 pre>

现在我们可以搜索'不匹配' ID：

 在[251]中：no_match_idx = d1.loc [〜d1.ID.isin（d2.ID），'idx']。unique（）
 
在[252]中：no_match_idx 
 Out [252]：array（[4]，dtype = int64）
 
在[253]中：df1 ['Matched_ID'] =〜df1 .index.isin（no_match_idx）
 ...：df1 ['Matched_Value'] = df1.Value.isin（df2.Value）

结果：

 在[254]中：df1 
输出[254 ]：
 ID值Matched_ID Matched_Value 
 0 AA12 101 BB101 CC01 DD06 1 True True 
 1 AA12 101 BB101 CC01 DD06 2 True True 
 2 AA11 102 BB101 CC01 2341 DD07 2 True True 
 3 AA10 202 BB101 CC01 3451 DD09 3 True True 
 4 AA13 103 BB101 CC02 1231 4 False True 
 5 AA14 203 BB101 CC02 4561 5 True False

Previous question: Pandas Compare two dataframes and determine the matched values

I have two dataframes:

print(a)

    ID                              Value
0   AA12 101 BB101 CC01 DD06        1
1   AA12 101 BB101 CC01 DD06        2
2   AA11 102 BB101 CC01 2341 DD07   2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02 1231        4
5   AA14 203 BB101 CC02 4561        5
print(b)

    ID                              Value
0   AA12 101 BB101 CC01 1351 DD06   1
1   AA12 101 BB101 CC01 1351 DD06   2
2   AA11 102 BB101 CC01 DD07        2
3   AA10 202 BB101 CC01 3451 DD09   3
4   AA13 103 BB101 CC02             4
5   AA14 203 BB101 CC02 4561        6

Desired output:

     ID                              Value   ID Matched?   Value Matched?
0    AA12 101 BB101 CC01 DD06        1       Yes           Yes 
1    AA12 101 BB101 CC01 DD06        2       Yes           Yes 
2    AA11 102 BB101 CC01 2341 DD07   2       Yes           Yes
3    AA10 202 BB101 CC01 3451 DD09   3       Yes           Yes
4    AA13 103 BB101 CC02 1231        4       No            Yes
5    AA14 203 BB101 CC02 4561        5       Yes           No

Here's the code written by @MaxU from the previous post:

pd.merge(a.assign(x=a.ID.str.split().apply(sorted).str.join(' ')),
             b.assign(x=b.ID.str.split().apply(sorted).str.join(' ')),
             on=['x','Value'],
             how='outer',
             indicator=True)

What I want to achieve:

If either dataframes do not contain four-digit item from ['ID'] (i.e. 2341, 3451), I want to exclude it from the matching process.
If Same ID appears more than once, they can have different values on ['Value'].

The result of this code is here. Unfortunately, it doesn't achieve the desired result. Only Index 3 gets matched. I was tweaking the code but couldn't figure out the next step.

Thanks you so much for your time and consideration!

解决方案

Try this:

first let's split and stack ID column in both DFs:

In [248]: d1 = df1.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})
     ...: d2 = df2.set_index('Value').ID.str.split(expand=True).stack().to_frame('ID').reset_index().rename(columns={'level_1':'idx'})

In [249]: d1
Out[249]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   DD06
5       2    0   AA12
6       2    1    101
7       2    2  BB101
8       2    3   CC01
9       2    4   DD06
10      2    0   AA11
11      2    1    102
12      2    2  BB101
13      2    3   CC01
14      2    4   2341
15      2    5   DD07
16      3    0   AA10
17      3    1    202
18      3    2  BB101
19      3    3   CC01
20      3    4   3451
21      3    5   DD09
22      4    0   AA13
23      4    1    103
24      4    2  BB101
25      4    3   CC02
26      4    4   1231
27      5    0   AA14
28      5    1    203
29      5    2  BB101
30      5    3   CC02
31      5    4   4561

In [250]: d2
Out[250]:
    Value  idx     ID
0       1    0   AA12
1       1    1    101
2       1    2  BB101
3       1    3   CC01
4       1    4   1351
5       1    5   DD06
6       2    0   AA12
7       2    1    101
8       2    2  BB101
9       2    3   CC01
10      2    4   1351
11      2    5   DD06
12      2    0   AA11
13      2    1    102
14      2    2  BB101
15      2    3   CC01
16      2    4   DD07
17      3    0   AA10
18      3    1    202
19      3    2  BB101
20      3    3   CC01
21      3    4   3451
22      3    5   DD09
23      4    0   AA13
24      4    1    103
25      4    2  BB101
26      4    3   CC02
27      6    0   AA14
28      6    1    203
29      6    2  BB101
30      6    3   CC02
31      6    4   4561

now we can search for 'not matched' IDs:

In [251]: no_match_idx = d1.loc[~d1.ID.isin(d2.ID), 'idx'].unique()

In [252]: no_match_idx
Out[252]: array([4], dtype=int64)

In [253]: df1['Matched_ID'] = ~df1.index.isin(no_match_idx)
     ...: df1['Matched_Value'] = df1.Value.isin(df2.Value)

Result:

In [254]: df1
Out[254]:
                              ID  Value Matched_ID Matched_Value
0       AA12 101 BB101 CC01 DD06      1       True          True
1       AA12 101 BB101 CC01 DD06      2       True          True
2  AA11 102 BB101 CC01 2341 DD07      2       True          True
3  AA10 202 BB101 CC01 3451 DD09      3       True          True
4       AA13 103 BB101 CC02 1231      4      False          True
5       AA14 203 BB101 CC02 4561      5       True         False

这篇关于比较两个数据帧并过滤匹配的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较两个数据帧并过滤匹配的值 [英] Compare two dataframes and filter the matched values

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较两个数据帧并过滤匹配的值 [英] Compare two dataframes and filter the matched values

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭