匹配间隔并提取两个矩阵R之间的值 [英] match with an interval and extract values between two matrix R

查看:90
本文介绍了匹配间隔并提取两个矩阵R之间的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在列表中有n个矩阵,另外一个矩阵包含要在矩阵列表中找到的值.

I have n matrix in a list and an additional matrix which contain the value I want to find in the list of matrix.

要获取矩阵列表,请使用以下代码:

To get the list of matrix, I use this code :

setwd("C:\\~\\Documents\\R") 


import.multiple.txt.files<-function(pattern=".txt",header=T)
{
list.1<-list.files(pattern=".txt")
list.2<-list()
for (i in 1:length(list.1))
{
list.2[[i]]<-read.delim(list.1[i])
}
names(list.2)<-list.1
list.2

}


txt.import.matrix<-cbind(txt.import)

我的列表如下所示:(我仅显示一个n = 2的示例).每个数组中的行数是不同的(在这里,为了简化起见,我只用了5和6行,但是我的真实数据中有500多个行.)

My list look like that: (I show only an example with n=2). The number of rows in each array is different (here I just take 5 and 6 rows to simplify but I have in my true data more than 500).

txt.import.matrix[1]

    [[1]]
     X.     RT.     Area.  m.z.      
1     1     1.01   2820.1  358.9777  
2     2     1.03   9571.8  368.4238  
3     3     2.03   6674.0  284.3294  
4     4     2.03   5856.3  922.0094  
5     5     3.03   27814.6 261.1299  


txt.import.matrix[2]

    [[2]]
     X.     RT.     Area.  m.z.      
1     1     1.01    7820.1 358.9777  
2     2     1.06    8271.8 368.4238  
3     3     2.03   12674.0 284.3294  
4     4     2.03    5856.6 922.0096  
5     5     2.03   17814.6 261.1299
6     6     3.65    5546.5 528.6475  

我想在矩阵列表中找到另一个值数组.该数组是通过将列表中的所有数组合并到一个数组中并删除重复项而获得的.

I have another array of values I want to find in the list of matrix. This array was obtained by combine all the array from the list in an array and removing the duplicates.

reduced.list.pre.filtering

     RT.   m.z.
1    1.01  358.9777
2    1.07  368.4238
3    2.05  284.3295
4    2.03  922.0092
5    3.03  261.1299
6    3.56  869.4558

我想获得一个新的矩阵,其中将列表中所有矩阵的RT. ± 0.02m.z. ± 0.0002匹配的Area.结果写入其中.输出可能是这样.

I would like to obtain a new matrix where it is written the Area. result of matched RT. ± 0.02 and m.z. ± 0.0002 for all the matrix in the list. The output could be like that.

     RT.   m.z.        Area.[1]      Area.[2]
1    1.01  358.9777    2820.1        7820.1
2    1.07  368.4238                  8271.8      
3    2.05  284.3295    6674.0        12674.0
4    2.03  922.0092    5856.3             
5    3.03  261.1299    27814.6            
6    3.65  528.6475    

我只知道如何在一个数组中只匹配一个精确值.这里的困难是要在数组列表中找到该值,并且需要找到该值±一个间隔.如果您有任何建议,我将不胜感激.

I have only an idea how to match only one exact value in one array. The difficulty here is to find the value in a list of array and need to find the value ± an interval. If you have any suggestion, I will be very grateful.

推荐答案

这是使用data.table解决Arun相当优雅的答案的另一种方法.我决定发布它,因为它包含两个附加方面,这是您的问题中的重要考虑因素:

This is an alternative approach to Arun's rather elegant answer using data.table. I decided to post it because it contains two additional aspects that are important considerations in your problem:

  1. 浮点比较:要比较浮点值是否在间隔中,需要在计算间隔时考虑舍入误差.这是比较实数浮点表示形式的普遍问题.参见

  1. Floating point comparison: comparison to see if a floating point value is in an interval requires consideration of the round-off error in computing the interval. This is the general problem of comparing floating point representations of real numbers. See this and this in the context of R. The following implements this comparison in the function in.interval.

多个匹配项:如果间隔重叠,则间隔匹配条件可能会导致多个匹配项.以下假设:您只想要第一个匹配项(相对于每个txt.import.matrix矩阵的增加行).这是在功能match.interval中实现的,并在后面的说明中进行了解释.如果您想要获得类似符合条件的区域平均值的信息,则需要其他逻辑.

Multiple matches: your interval match criterion can result in multiple matches if the intervals overlap. The following assumes that you only want the first match (with respect to increasing rows of each txt.import.matrix matrix). This is implemented in the function match.interval and explained in the notes to follow. Other logic is needed if you want to get something like the average of the areas that match your criterion.

要从txt.import.matrix中为矩阵reduced.list.pre.filtering中的每一行找到矩阵中的匹配行,以下代码对txt.import.matrix中的矩阵.在此应用程序的功能上,这与Arun使用data.tablenon-equi联接的解决方案相同;但是,non-equi连接功能更为通用,甚至对于该应用程序,data.table实现都可能针对内存使用和速度进行了更好的优化.

To find the matching row(s) in a matrix from txt.import.matrix for each row in the matrix reduced.list.pre.filtering, the following code vectorizes the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering and the matrix from txt.import.matrix. Functionally for this application, this is the same as Arun's solution using data.table's non-equi joins; however, the non-equi join feature is more general and the data.table implementation is most likely better optimized for both memory usage and speed for even this application.

in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
  return (abs(x-center) <= (deviation + tol))
}

match.interval <- function(r, t) {
  r.rt <- rep(r[,1], each=nrow(t))
  t.rt <- rep(t[,2], times=nrow(r))
  r.mz <- rep(r[,2], each=nrow(t))
  t.mz <- rep(t[,4], times=nrow(r))                                       ## 1.

  ind <- which(in.interval(r.rt, t.rt, 0.02) & 
               in.interval(r.mz, t.mz, 0.0002))
  r.ind <- floor((ind - 1)/nrow(t)) + 1                                   ## 2.

  dup <- duplicated(r.ind)
  r.ind <- r.ind[!dup]
  t.ind <- ind[!dup] - (r.ind - 1)*nrow(t)                                ## 3.
  return(cbind(r.ind,t.ind))                       
}

get.area.matched <- function(r, t) {
  match.ind <- match.interval(r, t)
  area <- rep(NA,nrow(r))
  area[match.ind[,1]] <- t[match.ind[,2], 3]                              ## 4.
  return(area)
}

res <- cbind(reduced.list.pre.filtering,
             do.call(cbind,lapply(txt.import.matrix, 
                                  get.area.matched, 
                                  r=reduced.list.pre.filtering)))         ## 5.
colnames(res) <- c(colnames(reduced.list.pre.filtering), 
                   sapply(seq_len(length(txt.import.matrix)), 
                          function(i) {return(paste0("Area.[",i,"]"))}))  ## 6.
print(res)
##      RT.     m.z. Area.[1] Area.[2]
##[1,] 1.01 358.9777   2820.1   7820.1
##[2,] 1.07 368.4238       NA   8271.8
##[3,] 2.05 284.3295   6674.0  12674.0
##[4,] 2.03 922.0092   5856.3       NA
##[5,] 3.03 261.1299  27814.6       NA
##[6,] 3.56 869.4558       NA       NA

注意:

  1. 这部分构建数据,以实现比较函数的应用矢量化,该比较函数可用于reduced.list.pre.filteringtxt.import.matrix之间的矩阵之间的所有枚举行对.要构造的数据是四个数组,分别是txt.import.matrix中每个矩阵的行维度中reduced.list.pre.filtering的两列的复制(或扩展),用于比较标准中,reduced.list.pre.filtering以及两列的复制,在比较标准中使用的reduced.list.pre.filtering行维度中每个txt.import.matrix矩阵.在此,术语阵列"是指2-D矩阵或1-D向量.得到的四个数组是:

  1. This part constructs the data to enable the vectorization of the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering and the matrix from txt.import.matrix. The data to be constructed are four arrays that are the replications (or expansions) of the two columns, used in the comparison criterion, of reduced.list.pre.filtering in the row dimension of each matrix from txt.import.matrix and the replications of the two columns, used in the comparison criterion, of each matrix from txt.import.matrix in the row dimension of reduced.list.pre.filtering. Here, the term array refers to either a 2-D matrix or a 1-D vector. The resulting four arrays are:

  • r.rtt
  • 的行维度中reduced.list.pre.filteringRT.列(即r[,1])的复制.
  • t.rtr
  • 行维度中txt.import.matrix(即t[,2])矩阵的RT.列的复制项
  • r.mzt
  • 的行维度中reduced.list.pre.filteringm.z.列(即r[,2])的复制.
  • t.mzr
  • 行维度中txt.import.matrix(即t[,4])矩阵的m.z.列的复制项
  • r.rt is the replication of the RT. column of reduced.list.pre.filtering (i.e., r[,1]) in the row dimension of t
  • t.rt is the replication of the RT. column of the matrix from txt.import.matrix (i.e., t[,2]) in the row dimension of r
  • r.mz is the replication of the m.z. column of reduced.list.pre.filtering (i.e. r[,2]) in the row dimension of t
  • t.mz is the replication of the m.z. column of the matrix from txt.import.matrix (i.e. t[,4]) in the row dimension of r

重要的是,这些数组中每个数组的索引都以相同的方式枚举rt中的所有行对.具体来说,将这些数组视为大小为M x N的二维矩阵,其中M=nrow(t)N=nrow(r),行索引对应于t的行,列索引对应于.因此,第i行和第j列(四个数组中的每一个)的数组值(在所有四个数组上)是在第j个之间的比较标准中使用的值r的行和t的第i行.此复制过程的实现使用R函数rep.例如,在计算r.rt时,将repeach=M一起使用,其作用是将其数组输入r[,1]视为行向量,并复制该行M次以形成M行.结果是,与r中的一行相对应的每一列都具有r的对应行中的RT.值,并且该值对于r.rt的所有行(该列的所有行)都相同,每个对应于t中的一行.这意味着在将r中的该行与t中的任何行进行比较时,将使用r中该行的RT.值.相反,在计算t.rt时,使用带有times=Nrep,具有将其数组输入视为列向量并将该列N复制一次以形成N列的效果.结果是t.rt中与t中的一行相对应的每一行都具有RT.的相应行中的RT.值,并且该列的所有列的值均相同t.rt中的每个,对应于r中的一行.这意味着在将t中的该行与r中的任何行进行比较时,将使用t中该行的RT.值.同样,分别使用rt中的m.z.列来计算r.mzt.mz.

What is important is that the indices for each of these arrays enumerate all pairs of rows in r and t in the same manner. Specifically, viewing these arrays as 2-D matrices of size M by N where M=nrow(t) and N=nrow(r), the row indices correspond to the rows of t and the column indices correspond to the rows of r. Consequently, the array values (over all four arrays) at the i-th row and the j-th column (of each of the four arrays) are the values used in the comparison criterion between the j-th row of r and the i-th row of t. Implementation of this replication process uses the R function rep. For example, in computing r.rt, rep with each=M is used, which has the effect of treating its array input r[,1] as a row vector and replicating that row M times to form M rows. The result is such that each column, which corresponds to a row in r, has the RT. value from the corresponding row of r and that value is the same for all rows (of that column) of r.rt, each of which corresponds to a row in t. This means that in comparing that row in r to any row in t, the value of RT. from that row in r is used. Conversely, in computing t.rt, rep with times=N is used, which has the effect of treating its array input as a column vector and replicating that column N times to form a N columns. The result is such that each row in t.rt, which corresponds to a row in t, has the RT. value from the corresponding row of t and that value is the same for all columns (of that row) of t.rt, each of which corresponds to a row in r. This means that in comparing that row in t to any row in r, the value of RT. from that row in t is used. Similarly, the computations of r.mz and t.mz follow using the m.z. column from r and t, respectively.

这将执行矢量化比较,从而生成M×N逻辑矩阵,其中,如果j -th,则第i行和第j列为TRUE r的行与条件与t的第i行匹配,否则与FALSE匹配. which()的输出是此逻辑比较结果矩阵的数组索引的数组,其中元素为TRUE.我们希望将这些数组索引转换为比较结果矩阵的行索引和列索引,以引用回rt的行.下一行从数组索引中提取列索引.请注意,变量名称为r.ind,以表示这些变量与r的行相对应.我们首先提取它,因为它对于检测r中的一行的多个匹配项很重要.

This performs the vectorized comparison resulting in a M by N logical matrix where the i-th row and the j-th column is TRUE if the j-th row of r matches the criterion with the i-th row of t, and FALSE otherwise. The output of which() is the array of array indices to this logical comparison result matrix where its element is TRUE. We want to convert these array indices to the row and column indices of the comparison result matrix to refer back to the rows of r and t. The next line extracts the column indices from the array indices. Note that the variable name is r.ind to denote that these correspond to the rows of r. We extract this first because it is important for detecting multiple matches for a row in r.

这部分处理r中给定行的t中可能存在的多个匹配项.多个匹配项将在r.ind中显示为重复值.如上所述,此处的逻辑仅在t中增加行方面保持第一个匹配.函数duplicated返回数组中所有重复值的索引.因此,删除这些元素将满足我们的要求.代码首先从r.ind中将其删除,然后从ind中将其删除,最后使用修剪后的indt的行相对应. >. match.interval返回的是一个矩阵,该矩阵的行是一对匹配的行索引,其第一列是r的行索引,第二列是t的行索引.

This part handles possible multiple matches in t for a given row in r. Multiple matches will show up as duplicate values in r.ind. As stated above, the logic here only keeps the first match in terms of increasing rows in t. The function duplicated returns all the indices of duplicate values in the array. Therefore removing these elements will do what we want. The code first removes them from r.ind, then it removes them from ind, and finally computes the column indices to the comparison result matrix, which corresponds to the rows of t, using the pruned ind and r.ind. What is returned by match.interval is a matrix whose rows are matched pair of row indices with its first column being row indices to r and its second column being row indices to t.

对于所有匹配项,get.area.matched函数仅使用match.ind中的结果从t中提取Area.请注意,返回的结果是一个(列)向量,其长度等于r中的行数,并被初始化为NA.这样,在r中与t不匹配的行将返回AreaNA.

The get.area.matched function simply uses the result from match.ind to extract the Area from t for all matches. Note that the returned result is a (column) vector with length equaling to the number of rows in r and initialized to NA. In this way rows in r that has no match in t has a returned Area of NA.

这使用lapply将函数get.area.matched应用到列表txt.import.matrix上,并将返回的匹配的Area结果作为列向量附加到reduced.list.pre.filtering.同样,适当的列名称也会附加并设置在结果res中.

This uses lapply to apply the function get.area.matched over the list txt.import.matrix and append the returned matched Area results to reduced.list.pre.filtering as column vectors. Similarly, the appropriate column names are also appended and set in the result res.

编辑:使用foreach软件包的替代实现

Alternative implementation using the foreach package

事后看来,更好的实现是使用foreach包对比较进行矢量化处理.在此实现中,需要foreachmagrittr软件包

In hindsight, a better implementation uses the foreach package for vectorizing the comparison. In this implementation, the foreach and magrittr packages are required

require("magrittr")  ## for %>%
require("foreach")

然后使用match.interval中的代码对比较进行矢量化

Then the code in match.interval for vectorizing the comparison

r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r))                       # 1.

ind <- which(in.interval(r.rt, t.rt, 0.02) & 
             in.interval(r.mz, t.mz, 0.0002))

可以替换为

ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:% 
         foreach(t.row = 1:nrow(t)) %do% 
           match.criterion(r.row, t.row, r, t) %>% 
             as.logical(.) %>% which(.)

其中match.criterion被定义为

match.criterion <- function(r.row, t.row, r, t) {
  return(in.interval(r[r.row,1], t[t.row,2], 0.02) & 
         in.interval(r[r.row,2], t[t.row,4], 0.0002))
}

这更易于解析和反映正在执行的操作.请注意,嵌套foreachcbind组合返回的内容再次是逻辑矩阵.最后,还可以使用foreach在列表txt.import.matrix上应用get.area.matched函数:

This is easier to parse and reflects what is being performed. Note that what is returned by the nested foreach combined with cbind is again a logical matrix. Finally, the application of the get.area.matched function over the list txt.import.matrix can also be performed using foreach:

res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do% 
         get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
           cbind(reduced.list.pre.filtering,.)

使用foreach的完整代码如下:

require("magrittr")
require("foreach")

in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
  return (abs(x-center) <= (deviation + tol))
}

match.criterion <- function(r.row, t.row, r, t) {
  return(in.interval(r[r.row,1], t[t.row,2], 0.02) & 
     in.interval(r[r.row,2], t[t.row,4], 0.0002))
}

match.interval <- function(r, t) {
  ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:% 
       foreach(t.row = 1:nrow(t)) %do% 
     match.criterion(r.row, t.row, r, t) %>% 
       as.logical(.) %>% which(.)
  # which returns 1-D indices (row-major),
  # convert these to 2-D indices in (row,col)
  r.ind <- floor((ind - 1)/nrow(t)) + 1                   ## 2.
  # detect duplicates in r.ind and remove them from ind
  dup <- duplicated(r.ind)
  r.ind <- r.ind[!dup]
  t.ind <- ind[!dup] - (r.ind - 1)*nrow(t)                ## 3.

  return(cbind(r.ind,t.ind))                       
}

get.area.matched <- function(r, t) {
  match.ind <- match.interval(r, t)
  area <- rep(NA,nrow(r))
  area[match.ind[,1]] <- t[match.ind[,2], 3]
  return(area)
}

res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do% 
     get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
       cbind(reduced.list.pre.filtering,.)

colnames(res) <- c(colnames(reduced.list.pre.filtering), 
           sapply(seq_len(length(txt.import.matrix)), 
              function(i) {return(paste0("Area.[",i,"]"))}))

希望这会有所帮助.

这篇关于匹配间隔并提取两个矩阵R之间的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆