匹配间隔并提取两个矩阵R之间的值 [英] match with an interval and extract values between two matrix R
问题描述
我在列表中有n个矩阵,另外一个矩阵包含要在矩阵列表中找到的值.
I have n matrix in a list and an additional matrix which contain the value I want to find in the list of matrix.
要获取矩阵列表,请使用以下代码:
To get the list of matrix, I use this code :
setwd("C:\\~\\Documents\\R")
import.multiple.txt.files<-function(pattern=".txt",header=T)
{
list.1<-list.files(pattern=".txt")
list.2<-list()
for (i in 1:length(list.1))
{
list.2[[i]]<-read.delim(list.1[i])
}
names(list.2)<-list.1
list.2
}
txt.import.matrix<-cbind(txt.import)
我的列表如下所示:(我仅显示一个n = 2的示例).每个数组中的行数是不同的(在这里,为了简化起见,我只用了5和6行,但是我的真实数据中有500多个行.)
My list look like that: (I show only an example with n=2). The number of rows in each array is different (here I just take 5 and 6 rows to simplify but I have in my true data more than 500).
txt.import.matrix[1]
[[1]]
X. RT. Area. m.z.
1 1 1.01 2820.1 358.9777
2 2 1.03 9571.8 368.4238
3 3 2.03 6674.0 284.3294
4 4 2.03 5856.3 922.0094
5 5 3.03 27814.6 261.1299
txt.import.matrix[2]
[[2]]
X. RT. Area. m.z.
1 1 1.01 7820.1 358.9777
2 2 1.06 8271.8 368.4238
3 3 2.03 12674.0 284.3294
4 4 2.03 5856.6 922.0096
5 5 2.03 17814.6 261.1299
6 6 3.65 5546.5 528.6475
我想在矩阵列表中找到另一个值数组.该数组是通过将列表中的所有数组合并到一个数组中并删除重复项而获得的.
I have another array of values I want to find in the list of matrix. This array was obtained by combine all the array from the list in an array and removing the duplicates.
reduced.list.pre.filtering
RT. m.z.
1 1.01 358.9777
2 1.07 368.4238
3 2.05 284.3295
4 2.03 922.0092
5 3.03 261.1299
6 3.56 869.4558
我想获得一个新的矩阵,其中将列表中所有矩阵的RT. ± 0.02
和m.z. ± 0.0002
匹配的Area.
结果写入其中.输出可能是这样.
I would like to obtain a new matrix where it is written the Area.
result of matched RT. ± 0.02
and m.z. ± 0.0002
for all the matrix in the list. The output could be like that.
RT. m.z. Area.[1] Area.[2]
1 1.01 358.9777 2820.1 7820.1
2 1.07 368.4238 8271.8
3 2.05 284.3295 6674.0 12674.0
4 2.03 922.0092 5856.3
5 3.03 261.1299 27814.6
6 3.65 528.6475
我只知道如何在一个数组中只匹配一个精确值.这里的困难是要在数组列表中找到该值,并且需要找到该值±一个间隔.如果您有任何建议,我将不胜感激.
I have only an idea how to match only one exact value in one array. The difficulty here is to find the value in a list of array and need to find the value ± an interval. If you have any suggestion, I will be very grateful.
推荐答案
这是使用data.table
解决Arun相当优雅的答案的另一种方法.我决定发布它,因为它包含两个附加方面,这是您的问题中的重要考虑因素:
This is an alternative approach to Arun's rather elegant answer using data.table
. I decided to post it because it contains two additional aspects that are important considerations in your problem:
-
浮点比较:要比较浮点值是否在间隔中,需要在计算间隔时考虑舍入误差.这是比较实数浮点表示形式的普遍问题.参见此和
Floating point comparison: comparison to see if a floating point value is in an interval requires consideration of the round-off error in computing the interval. This is the general problem of comparing floating point representations of real numbers. See this and this in the context of R. The following implements this comparison in the function
in.interval
.
多个匹配项:如果间隔重叠,则间隔匹配条件可能会导致多个匹配项.以下假设:您只想要第一个匹配项(相对于每个txt.import.matrix
矩阵的增加行).这是在功能match.interval
中实现的,并在后面的说明中进行了解释.如果您想要获得类似符合条件的区域平均值的信息,则需要其他逻辑.
Multiple matches: your interval match criterion can result in multiple matches if the intervals overlap. The following assumes that you only want the first match (with respect to increasing rows of each txt.import.matrix
matrix). This is implemented in the function match.interval
and explained in the notes to follow. Other logic is needed if you want to get something like the average of the areas that match your criterion.
要从txt.import.matrix
中为矩阵reduced.list.pre.filtering
中的每一行找到矩阵中的匹配行,以下代码对txt.import.matrix
中的矩阵.在此应用程序的功能上,这与Arun使用data.table
的non-equi
联接的解决方案相同;但是,non-equi
连接功能更为通用,甚至对于该应用程序,data.table
实现都可能针对内存使用和速度进行了更好的优化.
To find the matching row(s) in a matrix from txt.import.matrix
for each row in the matrix reduced.list.pre.filtering
, the following code vectorizes the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering
and the matrix from txt.import.matrix
. Functionally for this application, this is the same as Arun's solution using data.table
's non-equi
joins; however, the non-equi
join feature is more general and the data.table
implementation is most likely better optimized for both memory usage and speed for even this application.
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.interval <- function(r, t) {
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) ## 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3] ## 4.
return(area)
}
res <- cbind(reduced.list.pre.filtering,
do.call(cbind,lapply(txt.import.matrix,
get.area.matched,
r=reduced.list.pre.filtering))) ## 5.
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))})) ## 6.
print(res)
## RT. m.z. Area.[1] Area.[2]
##[1,] 1.01 358.9777 2820.1 7820.1
##[2,] 1.07 368.4238 NA 8271.8
##[3,] 2.05 284.3295 6674.0 12674.0
##[4,] 2.03 922.0092 5856.3 NA
##[5,] 3.03 261.1299 27814.6 NA
##[6,] 3.56 869.4558 NA NA
注意:
-
这部分构建数据,以实现比较函数的应用矢量化,该比较函数可用于
reduced.list.pre.filtering
和txt.import.matrix
之间的矩阵之间的所有枚举行对.要构造的数据是四个数组,分别是txt.import.matrix
中每个矩阵的行维度中reduced.list.pre.filtering
的两列的复制(或扩展),用于比较标准中,reduced.list.pre.filtering
以及两列的复制,在比较标准中使用的reduced.list.pre.filtering
行维度中每个txt.import.matrix
矩阵.在此,术语阵列"是指2-D矩阵或1-D向量.得到的四个数组是:
This part constructs the data to enable the vectorization of the application of the comparison function over the space of all enumerated pairs of rows between
reduced.list.pre.filtering
and the matrix fromtxt.import.matrix
. The data to be constructed are four arrays that are the replications (or expansions) of the two columns, used in the comparison criterion, ofreduced.list.pre.filtering
in the row dimension of each matrix fromtxt.import.matrix
and the replications of the two columns, used in the comparison criterion, of each matrix fromtxt.import.matrix
in the row dimension ofreduced.list.pre.filtering
. Here, the term array refers to either a 2-D matrix or a 1-D vector. The resulting four arrays are:
-
r.rt
是t
的行维度中 -
t.rt
是r
行维度中 -
r.mz
是t
的行维度中 -
t.mz
是r
行维度中
reduced.list.pre.filtering
的RT.
列(即r[,1]
)的复制.
txt.import.matrix
(即t[,2]
)矩阵的RT.
列的复制项
reduced.list.pre.filtering
的m.z.
列(即r[,2]
)的复制.
txt.import.matrix
(即t[,4]
)矩阵的m.z.
列的复制项
r.rt
is the replication of theRT.
column ofreduced.list.pre.filtering
(i.e.,r[,1]
) in the row dimension oft
t.rt
is the replication of theRT.
column of the matrix fromtxt.import.matrix
(i.e.,t[,2]
) in the row dimension ofr
r.mz
is the replication of them.z.
column ofreduced.list.pre.filtering
(i.e.r[,2]
) in the row dimension oft
t.mz
is the replication of them.z.
column of the matrix fromtxt.import.matrix
(i.e.t[,4]
) in the row dimension ofr
重要的是,这些数组中每个数组的索引都以相同的方式枚举r
和t
中的所有行对.具体来说,将这些数组视为大小为M
x N
的二维矩阵,其中M=nrow(t)
和N=nrow(r)
,行索引对应于t
的行,列索引对应于i
行和第j
列(四个数组中的每一个)的数组值(在所有四个数组上)是在第j
个之间的比较标准中使用的值r
的行和t
的第i
行.此复制过程的实现使用R函数rep
.例如,在计算r.rt
时,将rep
与each=M
一起使用,其作用是将其数组输入r[,1]
视为行向量,并复制该行M
次以形成M
行.结果是,与r
中的一行相对应的每一列都具有r
的对应行中的RT.
值,并且该值对于r.rt
的所有行(该列的所有行)都相同,每个对应于t
中的一行.这意味着在将r
中的该行与t
中的任何行进行比较时,将使用r
中该行的RT.
值.相反,在计算t.rt
时,使用带有times=N
的rep
,具有将其数组输入视为列向量并将该列N
复制一次以形成N
列的效果.结果是t.rt
中与t
中的一行相对应的每一行都具有RT.
的相应行中的RT.
值,并且该列的所有列的值均相同t.rt
中的每个,对应于r
中的一行.这意味着在将t
中的该行与r
中的任何行进行比较时,将使用t
中该行的RT.
值.同样,分别使用r
和t
中的m.z.
列来计算r.mz
和t.mz
.
What is important is that the indices for each of these arrays enumerate all pairs of rows in r
and t
in the same manner. Specifically, viewing these arrays as 2-D matrices of size M
by N
where M=nrow(t)
and N=nrow(r)
, the row indices correspond to the rows of t
and the column indices correspond to the rows of r
. Consequently, the array values (over all four arrays) at the i
-th row and the j
-th column (of each of the four arrays) are the values used in the comparison criterion between the j
-th row of r
and the i
-th row of t
. Implementation of this replication process uses the R function rep
. For example, in computing r.rt
, rep
with each=M
is used, which has the effect of treating its array input r[,1]
as a row vector and replicating that row M
times to form M
rows. The result is such that each column, which corresponds to a row in r
, has the RT.
value from the corresponding row of r
and that value is the same for all rows (of that column) of r.rt
, each of which corresponds to a row in t
. This means that in comparing that row in r
to any row in t
, the value of RT.
from that row in r
is used. Conversely, in computing t.rt
, rep
with times=N
is used, which has the effect of treating its array input as a column vector and replicating that column N
times to form a N
columns. The result is such that each row in t.rt
, which corresponds to a row in t
, has the RT.
value from the corresponding row of t
and that value is the same for all columns (of that row) of t.rt
, each of which corresponds to a row in r
. This means that in comparing that row in t
to any row in r
, the value of RT.
from that row in t
is used. Similarly, the computations of r.mz
and t.mz
follow using the m.z.
column from r
and t
, respectively.
这将执行矢量化比较,从而生成M
×N
逻辑矩阵,其中,如果j
-th,则第i
行和第j
列为TRUE
r
的行与条件与t
的第i
行匹配,否则与FALSE
匹配. which()
的输出是此逻辑比较结果矩阵的数组索引的数组,其中元素为TRUE
.我们希望将这些数组索引转换为比较结果矩阵的行索引和列索引,以引用回r
和t
的行.下一行从数组索引中提取列索引.请注意,变量名称为r.ind
,以表示这些变量与r
的行相对应.我们首先提取它,因为它对于检测r
中的一行的多个匹配项很重要.
This performs the vectorized comparison resulting in a M
by N
logical matrix where the i
-th row and the j
-th column is TRUE
if the j
-th row of r
matches the criterion with the i
-th row of t
, and FALSE
otherwise. The output of which()
is the array of array indices to this logical comparison result matrix where its element is TRUE
. We want to convert these array indices to the row and column indices of the comparison result matrix to refer back to the rows of r
and t
. The next line extracts the column indices from the array indices. Note that the variable name is r.ind
to denote that these correspond to the rows of r
. We extract this first because it is important for detecting multiple matches for a row in r
.
这部分处理r
中给定行的t
中可能存在的多个匹配项.多个匹配项将在r.ind
中显示为重复值.如上所述,此处的逻辑仅在t
中增加行方面保持第一个匹配.函数duplicated
返回数组中所有重复值的索引.因此,删除这些元素将满足我们的要求.代码首先从r.ind
中将其删除,然后从ind
中将其删除,最后使用修剪后的ind
和match.interval
返回的是一个矩阵,该矩阵的行是一对匹配的行索引,其第一列是r
的行索引,第二列是t
的行索引.
This part handles possible multiple matches in t
for a given row in r
. Multiple matches will show up as duplicate values in r.ind
. As stated above, the logic here only keeps the first match in terms of increasing rows in t
. The function duplicated
returns all the indices of duplicate values in the array. Therefore removing these elements will do what we want. The code first removes them from r.ind
, then it removes them from ind
, and finally computes the column indices to the comparison result matrix, which corresponds to the rows of t
, using the pruned ind
and r.ind
. What is returned by match.interval
is a matrix whose rows are matched pair of row indices with its first column being row indices to r
and its second column being row indices to t
.
对于所有匹配项,get.area.matched
函数仅使用match.ind
中的结果从t
中提取Area
.请注意,返回的结果是一个(列)向量,其长度等于r
中的行数,并被初始化为NA
.这样,在r
中与t
不匹配的行将返回Area
为NA
.
The get.area.matched
function simply uses the result from match.ind
to extract the Area
from t
for all matches. Note that the returned result is a (column) vector with length equaling to the number of rows in r
and initialized to NA
. In this way rows in r
that has no match in t
has a returned Area
of NA
.
这使用lapply
将函数get.area.matched
应用到列表txt.import.matrix
上,并将返回的匹配的Area
结果作为列向量附加到reduced.list.pre.filtering
.同样,适当的列名称也会附加并设置在结果res
中.
This uses lapply
to apply the function get.area.matched
over the list txt.import.matrix
and append the returned matched Area
results to reduced.list.pre.filtering
as column vectors. Similarly, the appropriate column names are also appended and set in the result res
.
编辑:使用foreach
软件包的替代实现
Alternative implementation using the foreach
package
事后看来,更好的实现是使用foreach
包对比较进行矢量化处理.在此实现中,需要foreach
和magrittr
软件包
In hindsight, a better implementation uses the foreach
package for vectorizing the comparison. In this implementation, the foreach
and magrittr
packages are required
require("magrittr") ## for %>%
require("foreach")
然后使用match.interval
中的代码对比较进行矢量化
Then the code in match.interval
for vectorizing the comparison
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) # 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
可以替换为
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
其中match.criterion
被定义为
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
这更易于解析和反映正在执行的操作.请注意,嵌套foreach
与cbind
组合返回的内容再次是逻辑矩阵.最后,还可以使用foreach
在列表txt.import.matrix
上应用get.area.matched
函数:
This is easier to parse and reflects what is being performed. Note that what is returned by the nested foreach
combined with cbind
is again a logical matrix. Finally, the application of the get.area.matched
function over the list txt.import.matrix
can also be performed using foreach
:
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
使用foreach
的完整代码如下:
require("magrittr")
require("foreach")
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
match.interval <- function(r, t) {
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
# which returns 1-D indices (row-major),
# convert these to 2-D indices in (row,col)
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
# detect duplicates in r.ind and remove them from ind
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3]
return(area)
}
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))}))
希望这会有所帮助.
这篇关于匹配间隔并提取两个矩阵R之间的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!