有条件地循环一个数据帧中的染色体和位置到另一个数据帧中的染色体和间隔 [英] Conditionally loop through chromosome and position in one dataframe to chromosome and intervals in other dataframe

查看:46
本文介绍了有条件地循环一个数据帧中的染色体和位置到另一个数据帧中的染色体和间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

df1= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
         'position':[50, 500, 1030, 2005 , 3575,50, 250]})
df2 = pd.DataFrame({'Chr':['1', '1', '1', '1',           
  '1','2','2','2','2','2','3','3','3','3','3'],
             'start':  
[0,100,1000,2000,3000,0,100,1000,2000,3000,0,100,1000,2000,3000],
             'end': 
 [100,1000,2000,3000,4000,100,1000,2000,3000,4000,100,1000,2000,3000,4000],
             'logr':[3, 4, 5, 6, 7,8,9,10,11,12,13,15,16,17,18],
             'seg':[0.2,0.5,0.2,0.1,0.5,0.5,0.2,0.2,0.1,0.2,0.1,0.5,0.5,0.9,0.3]})

我想有条件地将 df1 中的 'Chr' 和 'position' 循环到 df2 中的 'Chr' 和间隔(其中 df1 中的位置介于 'start' 和 'end' 之间),然后添加 'logr' 和 'df1中的seg'列

I wanted to conditionally loop through 'Chr' and 'position' in df1 to 'Chr' and intervals ( where the position in df1 falls between 'start' and 'end') in df2, then add 'logr' and 'seg'column in df1

我想要的输出是:

df3= pd.DataFrame({'Chr':['1', '1', '2', '2', '3','3','4'],
         'position':[50, 500, 1030, 2005 , 3575,50, 250],
           'logr':[3, 4, 10,11, 18,13, "NA"],
             'seg':[0.2,0.5,0.2,0.1,0.3,0.1,"NA"]})

提前致谢.

推荐答案

使用 DataFrame.merge 对所有组合使用外连接,然后通过 Series.betweenboolean indexingDataFrame.pop 用于提取列和最后的左连接添加缺失的行:

Use DataFrame.merge with outer join for all combinations, then filter by Series.between and boolean indexing with DataFrame.pop for extract columns and last left join for add missing rows:

df3 = df1.merge(df2, on='Chr', how='outer')
#between is by default inclusive (>=, <=) orwith parameter inclusive=False (>, <)
df3 = df3[df3['position'].between(df3.pop('start'), df3.pop('end'))]
#if need one inclusive and  another interval not (e.g. >, <=)
#df3 = df3[(df3['position'] > df3.pop('start')) & (df3['position'] <= df3.pop('end'))]
df3 = df1.merge(df3, how='left')
print (df3)
  Chr  position  logr  seg
0   1        50   3.0  0.2
1   1       500   4.0  0.5
2   2      1030  10.0  0.2
3   2      2005  11.0  0.1
4   3      3575  18.0  0.3
5   3        50  13.0  0.1
6   4       250   NaN  NaN

另一种解决方案:

df3 = df1.merge(df2, on='Chr', how='outer')
s = df3.pop('start')
e = df3.pop('end')
df3 = df3[df3['position'].between(s, e) | s.isna() | e.isna()]
#if different closed intervals
#df3 = df3[(df3['position'] > s) & (df3['position'] <= e) | s.isna() | e.isna()]
print (df3)
   Chr  position  logr  seg
0    1        50   3.0  0.2
6    1       500   4.0  0.5
12   2      1030  10.0  0.2
18   2      2005  11.0  0.1
24   3      3575  18.0  0.3
25   3        50  13.0  0.1
30   4       250   NaN  NaN

这篇关于有条件地循环一个数据帧中的染色体和位置到另一个数据帧中的染色体和间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆