如何根据优先级顺序替换数据框的列? [英] How to replace the column of dataframe based on priority order?
问题描述
我有一个如下的数据框df[Annotations"]
missense_variant&splice_region_variant
stop_gained&splice_region_variant
splice_acceptor_variant&coding_sequence_variant&intron_variant
splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&5_prime_UTR_variant&intron_variant
missense_variant&NMD_transcript_variant
frameshift_variant&splice_region_variant
splice_acceptor_variant&intron_variant
splice_acceptor_variant&coding_sequence_variant
stop_lost&3_prime_UTR_variant
missense_variant
splice_region_variant
我想替换或添加具有订单优先级的新列.优先级为
I want to replace or add a new column with priority of orders. Priority is given as
Type Rank
frameshift_variant 1
stop_gained 2
splice_region_variant 3
splice_acceptor_variant 4
splice_donor_variant 5
missense_variant 6
coding_sequence_variant 7
我想替换 df['Annotations'] 或添加新列 df['Anno_prio'] 为:
I want to get replace df['Annotations'] or add new column df['Anno_prio'] as:
splice_region_variant
stop_gained
splice_acceptor_variant
splice_acceptor_variant
missense_variant
frameshift_variant
splice_acceptor_variant
splice_acceptor_variant
stop_lost
missense_variant
splice_region_variant
我尝试的方法是针对每个术语:
The way I tried was for each term:
df['Annotation']=df['Annotation'].str.replace('missense_variant&splice_region_variant','splice_region_variant')
有没有其他方法可以使用熊猫来做到这一点?
Are there any other approach to do it using pandas?
推荐答案
想法是用 get
创建字典,默认值是每个值的最大 Rank
之后的下一个值字典理解中的拆分列表,然后获取字典的最小值的键:
Idea is create dictionary with get
with default value by next value after maximal Rank
for each value of splitted lists in dictionary comprehension and then get key of minimal value of dict:
d = df1.set_index('Type')['Rank'].to_dict()
max1 = df1['Rank'].max()+1
def f(x):
d1 = {y: d.get(y, max1) for y in x for y in x.split('&')}
#https://stackoverflow.com/a/280156/2901002
return min(d1, key=d1.get)
df['Anno_prio'] = df['Annotations'].apply(f)
print (df)
Annotations Anno_prio
0 missense_variant&splice_region_variant splice_region_variant
1 stop_gained&splice_region_variant stop_gained
2 splice_acceptor_variant&coding_sequence_varian... splice_acceptor_variant
3 splice_donor_variant&splice_acceptor_variant&c... splice_acceptor_variant
4 missense_variant&NMD_transcript_variant missense_variant
5 frameshift_variant&splice_region_variant frameshift_variant
6 splice_acceptor_variant&intron_variant splice_acceptor_variant
7 splice_acceptor_variant&coding_sequence_variant splice_acceptor_variant
8 stop_lost&3_prime_UTR_variant stop_lost
9 missense_variant missense_variant
10 splice_region_variant splice_region_variant
熊猫唯一的解决方案使用DataFrame.explode
与 DataFrame.sort_values
和 last 用排序索引删除重复的索引值:
Pandas only solution use DataFrame.explode
with DataFrame.sort_values
and last is removed duplicated index values with sorting index:
d = df1.set_index('Type')['Rank'].to_dict()
df = (df.assign(Anno_prio = df['Annotations'].str.split('&'))
.explode('Anno_prio')
.assign(new = lambda x: x['Anno_prio'].map(d))
.sort_values('new')
)
df = df[~df.index.duplicated()].sort_index()
print (df)
Annotations \
0 missense_variant&splice_region_variant
1 stop_gained&splice_region_variant
2 splice_acceptor_variant&coding_sequence_varian...
3 splice_donor_variant&splice_acceptor_variant&c...
4 missense_variant&NMD_transcript_variant
5 frameshift_variant&splice_region_variant
6 splice_acceptor_variant&intron_variant
7 splice_acceptor_variant&coding_sequence_variant
8 stop_lost&3_prime_UTR_variant
9 missense_variant
10 splice_region_variant
Anno_prio new
0 splice_region_variant 3.0
1 stop_gained 2.0
2 splice_acceptor_variant 4.0
3 splice_acceptor_variant 4.0
4 missense_variant 6.0
5 frameshift_variant 1.0
6 splice_acceptor_variant 4.0
7 splice_acceptor_variant 4.0
8 stop_lost NaN
9 missense_variant 6.0
10 splice_region_variant 3.0
这篇关于如何根据优先级顺序替换数据框的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!