如何根据R中另一个文件的多个条件选择文件行? [英] How to select lines of file based on multiple conditions of another file in R?

查看:107
本文介绍了如何根据R中另一个文件的多个条件选择文件行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个遗传数据集.我根据file2中的列过滤file1.但是,我还需要考虑file2中的第二列,并且不确定如何执行此操作.

I have 2 genetic datasets. I filter file1 based on a column in file2. However, I also need to account for a second column in file2 and I'm not sure how to do this.

文件1行提取的条件是,仅选择具有比文件2中同一染色体上的变体的任何染色体位置大5000倍以上或小于5000倍的染色体位置的行.

The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file 2 are selected.

例如,我的数据如下:

文件1:

Variant   Chromsome   Chromosome Position  
Variant1      2             14000     
Variant2      1             9000              
Variant3      8             37000          
Variant4      1             21000     

文件2:

Variant  Chromosome  Chromosome Position  
Variant1     1                 10000                   
Variant2     1                 20000                   
Variant3     8                 30000      

预期的输出(与相同染色体上的文件2的任何行相比,具有大于+/- 5000位置距离的变体):

Expected output (of variants with a greater than +/-5000 position distance in comparison to any line of file 2 on the same chromosome):

Variant   Chromosome Position     Chromosome
Variant1    14000                  2
Variant3    37000                  8

#Variant1 at 14000, whilst within 5000 + of Variant1 at 10000 in file2 is on a different chromosome and therefore not compared and is kept.
#Variant3 is on the same chromosome as Variant4 in file1 but larger than 5000+ distance and is kept.

我已经尝试过使用unix进行编码,但是对于每个变体,只有在不考虑染色体的情况下才获得大于5000 +/-的过滤,并且建议尝试在R中进行编码,但是我是R的新手,我不确定从哪儿开始.我假设我需要一个if语句,用于如果file1的行具有与file2匹配的染色体编号,然后仅在该染色体编号内执行大于5000 +/-的过滤",并带有for循环遍历每一行-甚至只是关于如何学习如何做到这一点将不胜感激.

I've tried coding using unix, however only got the larger than 5000 +/- filtering for each variant without chromosome consideration and been advised to try coding in R, however I'm new to R and I'm not sure where to start. I assume I need an if statement for "if line of file1 has matching chromosome number as file2, then perform the larger than 5000 +/- filtering within that chromosome number only" with a for loop for going over each row - even just guidance on how to learn how to do this would be appreciated.

推荐答案

使用您的示例数据和方法,我想到了这个data.table-解决方案

Using your sample data and methods, I came up with this data.table-solution

代码中有简短的解释.

library( data.table)
#sample data
dt1 <- fread("Variant   Chromosome   Chromosome_Position  
Variant1      2             14000     
Variant2      1             9000              
Variant3      8             37000          
Variant4      1             21000")
dt2 <- fread("Variant  Chromosome  Chromosome_Position  
Variant1     1                 10000                   
Variant2     1                 20000                   
Variant3     8                 30000")

#create lower&upper boundaries for dt2 chromosome position
dt2[, c("low", "high") := .(Chromosome_Position - 5000, Chromosome_Position + 5000)]
#dt2 now looks like this:
#-------------------------------------------------------------
#     Variant Chromosome Chromosome_Position   low  high
# 1: Variant1          1               10000  5000 15000
# 2: Variant2          1               20000 15000 25000
# 3: Variant3          8               30000 25000 35000

#find matches on chromosome, with position bewtene low-high
#  this is done using a non-equi join using the lower and upper boundaries
#  created in dt2 in the previous line.
#  on = .(...) means: Chromosome in dt1 and dt2 have to be the same
#                     Chromosome_Position in dt1 has to be between 
#                       low and high of dt2. Y
#                       You can (of course) use >= and <= if desired.
#  match := i.Variant creates a new column in dt1, with the value of
#                     Variant from dt2 (if a match is found).
#                     If no match is found, the columns gets a <NA>.                          
dt1[ dt2, match := i.Variant,
     on = .(Chromosome, Chromosome_Position > low, Chromosome_Position < high ) ]
#dt1 now looks like this
#see the match-column for found dt1-matches in dt2
#-------------------------------------------------------------
#     Variant Chromosome Chromosome_Position    match
# 1: Variant1          2               14000     <NA>
# 2: Variant2          1                9000 Variant1
# 3: Variant3          8               37000     <NA>
# 4: Variant4          1               21000 Variant2

#discard all found matches (i.e. is.na(Match) == TRUE), and drop match-column,
#  since we no longer need it.
dt1[ is.na(match) ][, match := NULL ][]

#     Variant Chromosome Chromosome_Position
# 1: Variant1          2               14000
# 2: Variant3          8               37000

这篇关于如何根据R中另一个文件的多个条件选择文件行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆