如何将范围内的行分组并考虑第三列? [英] How to group rows in a range and consider a 3rd column?

查看:24
本文介绍了如何将范围内的行分组并考虑第三列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个遗传数据集,我想对在基因组中物理上紧密靠近的遗传变异/行进行分组.我想对每条染色体( chrom )在基因组中某些点范围内的基因进行分组.

我的斑点"数据集的变体/行位置必须在范围内,并且看起来像:

 铬低高1500 17001 19500 206005 400 1500 

我的 low high 列是我想查看下一个数据集中是否有任何行属于的范围,同时还要考虑染色体(chrom )也必须匹配.每行具有唯一范围和色度组合的行都是其自己的组,我希望查看该行是否属于我的其他数据集中的任何数据.

我的另一个数据集具有一个位置值,我希望通过匹配的 chrom 来查看是否适合上述任何范围,以便将其标记为与该范围相对应,然后我可以将相同范围内的位置和色度组合在一起:

 基因色度位置基因1 1200基因2 1 10000基因3500基因4 560基因5 20100 

我尝试使用 group_by() between()设置范围,因为看到了其他与日期/时间范围类似的问题,但是我在努力寻找范围之前,我在努力解决在数据集之间匹配染色体( chrom )的需求.

输出如下:

 基因色度位置组Gene1 1 1200 1 #position在一个范围内,并且与chrom匹配,因此在一个组中Gene2 1 10000 NA#不适合色度2的任何范围(无匹配项)Gene3 5 500 2 #position在一个范围内,并且与chrom匹配,因此在一个组中Gene4 5 560 2 #position与上面的相同,因此它加入该组Gene5 1 20100 3 #position匹配色度和范围,因此得到一个与该特定色度和范围相对应的组 

  • Gene3和Gene4不在组1中,因为它们位于不同的 chrom 上,但它们确实匹配chrom,并且在我的第一个数据集的第3行范围内-因此它们可以在与该范围和色度相对应的组中.
  • Gene5与Gene1不在同一个组中,因为它们匹配 chrom ,它们处于 low high 的不同范围内,所以得到自己的组来确定唯一范围.

因此,我要创建一个 Group 列,并为同一行上 low high 之间相同范围内的所有行提供一个共享编号code> chrom ,或者如果它们的位置在第一个数据集中的任何范围和chrom中都不匹配,则为NA.

输入数据:

  df1<-结构(列表(chrom = c(1L,1L,5L),低= c(500L,19500L,400L),高= c(1700L,20600L,1500L)),row.names = c(NA,-3L),class = c("data.table","data.frame"))df2<-结构(列表(Gene = c("Gene1","Gene2","Gene3","Gene4","Gene5";),色度= c(1L,1L,5L,5L,1L),位置= c(1200L,10000L,500L,560L,20100L)),row.names = c(NA,-5L),class = c("data.table","data.frame")) 

我还希望为每个唯一的范围和色度组合提供第一个数据集唯一的标识符,然后将该标识符分配给数据集2中也匹配该组合的任何行,以便该标识符创建我的组号列.尽管我的真实数据是2.3k行范围和82k行以匹配到共享组,所以我在运行dplyr选项时也遇到了问题,通常我会尝试.

解决方案

您可以在 data.table 中使用非等价联接:

 库(data.table)df1<-setDT(df1)df2<-setDT(df2)df1 [,group:= 1:.N]df1 [df2,on =.(色度,低<位置,高>位置)]铬低位高基团基因1:1 1200 1200 1基因12:1 10000 10000 NA基因23:5 500 500 3基因34:5560560 3 Gene45:1 20100 20100 2 Gene5 

在这里,我首先为 df1 的每一行设置一个组.合并后,如果满足条件,则将行与组关联.

非等值合并不是超级直观,而是超级强大和明确:合并条件.(chrom,低<位置,高>位置)基本上就是您要明确显示的内容(您想要相同的染色体,位置在高低之间.

data.table 中,当您这样做

  df1 [df2,on =某物] 

df2 的行满足 on 表示的条件的子集 df1 .如果 something 只是 df1 df2 的公共变量,则它等效于

  merge(df1,df2,all.y = T,by ="someting") 

但是 something 可以是两个data.tables变量之间的变量和条件的列表.在这里,.()表示一个列表,而.(chrom,低<位置,高>位置)表示您在变量 chrom 上合并code>(在两个data.tables之间相同),并且 low<位置高>位置.表达不等式时,必须从主data.table中的变量(此处为 df1 )开始,然后是子设置data.table中的变量(为 df2 )./p>

使用不等式的此非等值合并的输出用子集data.table的变量(即 df2)替换以主data.table(即 df1 )的不等式表示的变量.),因此 low high 变为 position .如果要保持 low high 值,则应将它们复制到另一个变量中,或合并到这些变量的副本上.

您实际上可以执行相反的合并,您将在相同条件下通过 df1 条目对 df2 进行子集设置:

  df2 [df1,on =.(chrom,position> low,position< high)]基因色度位置position.1组1:基因1500 500 1700 12:基因5 1 19500 20600 23:基因3 5400 1500 34:Gene4 5 400 1500 3 

在这里将 df2 的条目满足 on =.()中表达的条件的子集 df1 ,并获得<实际上属于一个组的code> Gene (由于与子集不匹配,因此 Gene2 不在此处).

类似于上面解释的内容,此处 position 变为 low high


编辑

我刚刚看到@DavidArenburg的评论,它是我提出和解释的内容的更简洁和更好的版本:

  df2 [,grp:= df1 [.SD,= TRUE,on =.(色度,低<==位置,高> ==位置)]] 

使用哪个将非等合并 df1 [df2,on =.(chrom,low< position,high> position)] 的结果直接关联到组变量.= TRUE ,这将为您提供符合 df1 [df2,在= ....] 上合并条件的 df2 行.

I have a genetic dataset where I want to group genetic variants/rows that are physically close together in the genome. I want to group genes that are within ranges from certain spots in the genome per chromosome (chrom).

My 'spots' dataset is of positions that variants/rows need to be within a range of and looks like:

 chrom      low       high
   1        500       1700
   1        19500     20600
   5        400       1500

My low and high columns are the ranges that I want to see if any rows in my next dataset fall into, with also accounting that the chromosome (chrom) must also match. Each row with a unique range and chrom combination is its own group for which I am looking to see if anything in my other dataset falls into.

My other dataset has a position value that I'm looking to see if fits in any of the ranges above with matching chrom, in order to label it as corresponding to that range, and then I can group positions in the same range and chrom together:

Gene   chrom position 
Gene1   1    1200          
Gene2   1    10000        
Gene3   5    500 
Gene4   5    560
Gene5   1    20100           

I've tried using group_by() and between() to set up the range, since seeing other questions that are similar for dates/times ranges, but I'm struggling to account for the need to match the chromosome (chrom) between the datasets before then searching for range.

Output would look like:

Gene   chrom position   Group 
Gene1   1    1200          1  #position is in one of the ranges and matches the chrom so is in a group    
Gene2   1    10000        NA  #does not fit into any range on chrom 2 (no matches)
Gene3   5    500           2  #position is in one of the ranges and matches the chrom so is in a group
Gene4   5    560           2  #position is in the same range and chrom as above so joins that group
Gene5   1    20100         3  #position matches a chrom and range and so gets a group corresponding to that particular chrom and range

  • Gene3 and Gene4 are not in group 1 because they are on a different chrom, but they do match the chrom and are within range of of the 3rd line of my first dataset - so they get to be in the group that corresponds to that range and chrom.
  • Gene5 is not in the same group as Gene1 as whilst they match chrom they are in different ranges of low and high, so get their own groups for the unique ranges.

So I am creating a Group column with a shared number for all rows in the same range between low and high on the same chrom, or NA if their position doesn't match in any range and chrom in the first dataset.

Input data:

df1 <- 
structure(list(chrom = c(1L, 1L, 5L), 
   low = c(500L, 19500L, 400L), high = c(1700L, 20600L, 1500L
    )), row.names = c(NA, -3L), class = c("data.table", "data.frame"))

df2 <- 
structure(list(Gene = c("Gene1", "Gene2", "Gene3", "Gene4", "Gene5"
), chrom = c(1L, 1L, 5L, 5L, 1L), position = c(1200L, 10000L, 
500L, 560L, 20100L)), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

I'm also looking into giving my first dataset unique identifiers per each unique range and chrom combination and then assign that identifier to any row in dataset 2 that matches the combination too, so that identifier creates my group numbers column. Although my real data is 2.3k rows of ranges and 82k rows to match into shared groups so I'm also having problems running dplyr options I would normally try.

解决方案

You could use non equi join in data.table:

library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)

df1[,group := 1:.N]
df1[df2,on = .(chrom, low < position, high > position)]


   chrom   low  high group  Gene
1:     1  1200  1200     1 Gene1
2:     1 10000 10000    NA Gene2
3:     5   500   500     3 Gene3
4:     5   560   560     3 Gene4
5:     1 20100 20100     2 Gene5

Here I first set a group for each line of df1. After the merge, the line is associated to a group if the condition is met.

Non equi merge are not super intuitive, but super powerfull, and explicit: the merging condition .(chrom, low < position, high > position) is letterally what you explicited (you want same chromosome, and position between low and high).

In data.table, when you do

df1[df2,on = something]

you subset df1 with the lines of df2 meeting the condition expressed by on. If something is just a common variable of df1 and df2, then it is equivalent to

merge(df1,df2,all.y = T,by = "someting")

But something can be a list of variable and conditions between the variables of your two data.tables. Here, .() indicates a list, and .(chrom,low < position, high > position) indicate you merge on the variable chrom (identical between the two data.tables), and low < position, and high > position. When you express inequality, you must start with the variable from the main data.table (df1 here), then the variables of the subsetting data.table (df2).

The output of this non equi merge using inequalities replace the variable expressed in inequalities of the main data.table (i.e. df1) by the variables of the subsetting data.table (i.e. df2 here), and so low and high become position. If you want to keep the low and high values, you should copy them in an other variable, or merge on a copy of these variables.

You can actually do the opposite merge, wew you subset df2 by df1 entries, with the same condition:

df2[df1,on = .(chrom,position >low , position<high)]

    Gene chrom position position.1 group
1: Gene1     1      500       1700     1
2: Gene5     1    19500      20600     2
3: Gene3     5      400       1500     3
4: Gene4     5      400       1500     3

Here you subset df1 with the entries of df2 meeting the conditions expressed in on = .(), and obtain the list of Gene that actually belong to a group (Gene2 is not here because it does not match the subset).

Similarly to what has been explained above, here position become low and high


Edit

I just saw @DavidArenburg 's comment, and it is a more condensed and better version of what I proposed and explained:

df2[, grp := df1[.SD, which = TRUE, on = .(chrom, low <= position, high >= position)]]

directly associate the result of the non equi merge df1[df2,on = .(chrom, low < position, high > position)] to the group variable, using which = TRUE, which gives you the line of df2 which meet the merge condition of df1[df2 , on =....].

这篇关于如何将范围内的行分组并考虑第三列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆