Python Pandas:替换多个列中匹配另一个数据框中多个列的值 [英] Python pandas: replace values multiple columns matching multiple columns from another dataframe

查看:704
本文介绍了Python Pandas:替换多个列中匹配另一个数据框中多个列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索了很多答案,最接近的问题是

df2

ID           CHR   STOP  OCHR  OSTOP
rs376643643    1  10040     1  10020
rs373328635    1  10066     1  10056    
rs62651026     1  10208     1  10108    
rs376007522    1  10209     1  10109   
rs368469931    3  30247     1  10139

仅在df1 [['chr','pos'时,我才需要使用df2 [['ID','OCHR','OSTOP']]更新df1中的['snp','chr','pos'] ']]匹配df2 [[''OCHR','OSTOP']]

因此,在这种情况下,更新后,df1将如下所示:

chr       snp  x     pos a1 a2    
1  rs376643643  0  10040  G  A    
1  rs373328635  0  10066  C  G    
1  rs62651026   0  10208  C  G    
1  rs376007522  0  10209  C  G    
3  rs368469931  0  30247  C  T

我已使用merge作为一种解决方法:

df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
                                     left_index=False, right_index=False, sort=False)

然后

df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values

,然后删除多余的列.

是的,它可以工作,但是通过比较两个数据帧中的值来直接做到这一点的方法是什么,我只是不知道该如何制定它,而且我在任何地方都找不到答案.我想可能对此有一个普遍的答案.

我尝试过,但是没用:

df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values

谢谢

Stephane

解决方案

您可以使用update函数(需要将匹配条件设置为索引).我已经修改了您的示例数据,以允许出现一些不匹配的情况.

# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)

   chr      snp  x    pos a1 a2
0    1  1-10020  0  10010  G  A
1    1  1-10056  0  10056  C  G
2    1  1-10108  0  10108  C  G
3    1  1-10109  0  10109  C  G
4    1  1-10139  0  10139  C  T

print(df2)

            ID  CHR   STOP  OCHR  OSTOP
0  rs376643643    1  10040     1  10020
1  rs373328635    1  10066     1  10056
2   rs62651026    1  10208     1  10108
3  rs376007522    1  10209     1  10109
4  rs368469931    3  30247     1  10139

# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)

   chr          snp  x    pos a1 a2
0    1      1-10020  0  10010  G  A
1    1  rs373328635  0  10056  C  G
2    1   rs62651026  0  10108  C  G
3    1  rs376007522  0  10109  C  G
4    1  rs368469931  0  10139  C  T

I searched a lot for an answer, the closest question was Compare 2 columns of 2 different pandas dataframes, if the same insert 1 into the other in Python, but the answer to this person's particular problem was a simple merge, which doesn't answer the question in a general way.

I have two large dataframes, df1 (typically about 10 million rows), and df2 (about 130 million rows). I need to update values in three columns of df1 with values from three columns of df2, based on two df1 columns matching two df2 columns. It is imperative that the order of df1 remains unchanged, and that only rows with matching values get updated.

This is how the dataframes look like:

df1

chr    snp  x    pos a1 a2
1  1-10020  0  10020  G  A    
1  1-10056  0  10056  C  G    
1  1-10108  0  10108  C  G
1  1-10109  0  10109  C  G    
1  1-10139  0  10139  C  T

Note that it's not always the case that the values of "snp" is chr-pos, it can take many other values with no link to any of the columns (like rs1234, indel-6032 etc)

df2

ID           CHR   STOP  OCHR  OSTOP
rs376643643    1  10040     1  10020
rs373328635    1  10066     1  10056    
rs62651026     1  10208     1  10108    
rs376007522    1  10209     1  10109   
rs368469931    3  30247     1  10139

I need to update ['snp', 'chr', 'pos'] in df1 with df2[['ID', 'OCHR', 'OSTOP']] only when df1[['chr', 'pos']] matches df2[['OCHR', 'OSTOP']]

so in this case, after update, df1 would look like:

chr       snp  x     pos a1 a2    
1  rs376643643  0  10040  G  A    
1  rs373328635  0  10066  C  G    
1  rs62651026   0  10208  C  G    
1  rs376007522  0  10209  C  G    
3  rs368469931  0  30247  C  T

I have used merge as a workaround:

df1 = pd.merge(df1, df2, how='left', left_on=["chr", "pos"], right_on=["OCHR", "OSTOP"],
                                     left_index=False, right_index=False, sort=False)

and then

df1.loc[~df1.OCHR.isnull(), ["snp", "chr", "pos"]] = df1.loc[~df1.OCHR.isnull(), ["ID", "CHR", "STOP"]].values

and then remove the extra columns.

Yes, it works, but what would be a way to do that directly by comparing the values from both dataframes, I just don't know how to formulate it, and I couldn't find an answer anywhere; I guess it could be useful to get a general answer on this.

I tried that but it doesn't work:

df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["snp", "chr", "pos"]] = df2.loc[df2[['OCHR', 'OSTOP']] == df1.loc[(df1.chr==df2.OCHR) & (df1.pos==df2.OSTOP),["chr", "pos"]],['ID', ''CHR', 'STOP']].values

Thanks,

Stephane

解决方案

You can use the update function (requires setting the matching criteria to index). I've modified your sample data to allow some mismatch.

# your data
# =====================
# df1 pos is modified from 10020 to 10010
print(df1)

   chr      snp  x    pos a1 a2
0    1  1-10020  0  10010  G  A
1    1  1-10056  0  10056  C  G
2    1  1-10108  0  10108  C  G
3    1  1-10109  0  10109  C  G
4    1  1-10139  0  10139  C  T

print(df2)

            ID  CHR   STOP  OCHR  OSTOP
0  rs376643643    1  10040     1  10020
1  rs373328635    1  10066     1  10056
2   rs62651026    1  10208     1  10108
3  rs376007522    1  10209     1  10109
4  rs368469931    3  30247     1  10139

# processing
# ==========================
# set matching columns to multi-level index
x1 = df1.set_index(['chr', 'pos'])['snp']
x2 = df2.set_index(['OCHR', 'OSTOP'])['ID']
# call update function, this is inplace
x1.update(x2)
# replace the values in original df1
df1['snp'] = x1.values
print(df1)

   chr          snp  x    pos a1 a2
0    1      1-10020  0  10010  G  A
1    1  rs373328635  0  10056  C  G
2    1   rs62651026  0  10108  C  G
3    1  rs376007522  0  10109  C  G
4    1  rs368469931  0  10139  C  T

这篇关于Python Pandas:替换多个列中匹配另一个数据框中多个列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆