如何根据优先级顺序替换数据框的列? [英] How to replace the column of dataframe based on priority order?

查看:50
本文介绍了如何根据优先级顺序替换数据框的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下的数据框df[Annotations"]

missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant

我想替换或添加具有订单优先级的新列.优先级为

I want to replace or add a new column with priority of orders. Priority is given as

Type                 Rank
frameshift_variant      1
stop_gained             2
splice_region_variant   3
splice_acceptor_variant 4
splice_donor_variant    5
missense_variant        6
coding_sequence_variant 7

我想替换 df['Annotations'] 或添加新列 df['Anno_prio'] 为:

I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:

splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant

我尝试的方法是针对每个术语:

The way I tried was for each term:

df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')

有没有其他方法可以使用熊猫来做到这一点?

Are there any other approach to do it using pandas?

推荐答案

想法是用 get 创建字典,默认值是每个值的最大 Rank 之后的下一个值字典理解中的拆分列表,然后获取字典的最小值的键:

Idea is create dictionary with get with default value by next value after maximal Rank for each value of splitted lists in dictionary comprehension and then get key of minimal value of dict:

d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1    

def f(x):
    d1 = {y: d.get(y, max1) for y in x for y in x.split('&')}
    #https://stackoverflow.com/a/280156/2901002
    return min(d1, key=d1.get)

df['Anno_prio'] = df['Annotations'].apply(f)
print (df)
                                          Annotations                Anno_prio
0              missense_variant&splice_region_variant    splice_region_variant
1                   stop_gained&splice_region_variant              stop_gained
2   splice_acceptor_variant&coding_sequence_varian...  splice_acceptor_variant
3   splice_donor_variant&splice_acceptor_variant&c...  splice_acceptor_variant
4             missense_variant&NMD_transcript_variant         missense_variant
5            frameshift_variant&splice_region_variant       frameshift_variant
6              splice_acceptor_variant&intron_variant  splice_acceptor_variant
7     splice_acceptor_variant&coding_sequence_variant  splice_acceptor_variant
8                       stop_lost&3_prime_UTR_variant                stop_lost
9                                    missense_variant         missense_variant
10                              splice_region_variant    splice_region_variant

熊猫唯一的解决方案使用DataFrame.explodeDataFrame.sort_values 和 last 用排序索引删除重复的索引值:

Pandas only solution use DataFrame.explode with DataFrame.sort_values and last is removed duplicated index values with sorting index:

d = df1.set_index('Type')['Rank'].to_dict()

df = (df.assign(Anno_prio = df['Annotations'].str.split('&'))
        .explode('Anno_prio')
        .assign(new = lambda x: x['Anno_prio'].map(d))
        .sort_values('new')
        )
df = df[~df.index.duplicated()].sort_index()


print (df)
                                          Annotations  \
0              missense_variant&splice_region_variant   
1                   stop_gained&splice_region_variant   
2   splice_acceptor_variant&coding_sequence_varian...   
3   splice_donor_variant&splice_acceptor_variant&c...   
4             missense_variant&NMD_transcript_variant   
5            frameshift_variant&splice_region_variant   
6              splice_acceptor_variant&intron_variant   
7     splice_acceptor_variant&coding_sequence_variant   
8                       stop_lost&3_prime_UTR_variant   
9                                    missense_variant   
10                              splice_region_variant   

                  Anno_prio  new  
0     splice_region_variant  3.0  
1               stop_gained  2.0  
2   splice_acceptor_variant  4.0  
3   splice_acceptor_variant  4.0  
4          missense_variant  6.0  
5        frameshift_variant  1.0  
6   splice_acceptor_variant  4.0  
7   splice_acceptor_variant  4.0  
8                 stop_lost  NaN  
9          missense_variant  6.0  
10    splice_region_variant  3.0  

这篇关于如何根据优先级顺序替换数据框的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆