匹配来自两个不同数据帧的键 [英] matching keys from two different dataframes

查看:72
本文介绍了匹配来自两个不同数据帧的键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,

df1,
    Name    Stage   Description                                 key
0   Sri      1      Sri is one of the good singer in this two   one
1   NaN      2      Thanks for reading                          two has
2   Ram      1      Ram is two of the good cricket player       three
3   ganesh   1      one driver                                  four
4   NaN      2      good buddies                                NaN


 df2,
    values
    member of four
    one of three friends
    sri is a cricketer
    Rahul has two brothers

如果密钥存在于df2.values中,我想用df2值替换df1 ["key"].

I want to replace the df1["key"] with df2 values, if the key is present in df2.values.

I tried, df1["key"]=df2[df2["values"].str.contains("|".join(df2["values"].tolist()),na=False)]

但是我得到的输出顺序相同,

But i am getting the output in the same order,

我想要

    output_df,
        Name    Stage   Description                                 key
0   Sri      1      Sri is one of the good singer in this two   one of three friends
1   NaN      2      Thanks for reading                          Rahul has two brothers
2   Ram      1      Ram is two of the good cricket player       one of three friends
3   ganesh   1      one driver                                  member of four
4   NaN      2      good buddies                                NaN

推荐答案

我将使用集合数组,并使用<=进行子集测试和numpy广播.

I'll use arrays of sets and use <= for subsetting testing and numpy broadcasting.

setify = lambda x: set(x.split())
v = df2['values'].values.astype(str)
k = df1['key'].values.astype(str)
i = df1.index

# These the sets
a = np.array([setify(x) for x in k.tolist()])
b = np.array([setify(x) for x in v.tolist()])

# This is the broadcasting
matches = (a[:, None] <= b)

# Additional testing that there exist any matches
any_ = matches.any(1)
# Test that wasn't null in the first place
nul_ = df1['key'].notnull().values
mask = any_ & nul_

# And argmax to find where the first set match is.  There
# may be more than one match.  I chose to use `assign`
# therefore I used `mask` to pass a slice of a series
# to target the correct rows.
df1.assign(key1=pd.Series(v[matches.argmax(1)], i)[mask])

     Name  Stage                                Description      key                    key1
0     Sri      1  Sri is one of the good singer in this two      one    one of three friends
1     NaN      2                         Thanks for reading  two has  Rahul has two brothers
2     Ram      1      Ram is two of the good cricket player    three    one of three friends
3  ganesh      1                                 one driver     four          member of four
4     NaN      2                               good buddies      NaN                     NaN

这篇关于匹配来自两个不同数据帧的键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆