如何计算向量和序列坐标数据帧之间的匹配? [英] How to count matches between a vector and dataframe of sequence coordinates?

查看:35
本文介绍了如何计算向量和序列坐标数据帧之间的匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个包含整数序列的起始和结束坐标的数据表:

Given a data table with start and end coordinates for sequences of integers:

set.seed(1)

df1 <- data.table(
  START = c(seq(1, 10000000, 10), seq(1, 10000000, 10), seq(1, 10000000, 10)),
  END = c(seq(10, 10000000, 10), seq(10, 10000000, 10), seq(10, 10000000, 10)) 

以及整数向量:

vec1 <- sample(1:100000, 10000)

如何计算vec1中df1中每个序列的开始和结束坐标内的整数数目?我目前正在使用for循环:

How can I count the number of integers in vec1 that are within the start and end coordinates of each sequence in df1? I am currently using a for loop:

COUNT <- rep(NA, nrow(df1)) 
for (i in 1:nrow(df1)){
  vec2 <- seq(from = df1$START[i], to = df1$END[i])
  COUNT[i] <- table(vec2 %in% vec1)[2]
  print(i)
}
df1$COUNT <- COUNT

但是,我将其应用到的数据表和向量非常大?有谁能够提出提高性能的方法?

However, the datatable and vector I am applying this to are very large? Is anyone able to suggest a way to improve performance?

任何帮助将不胜感激!

推荐答案

### example data:
# df1 <- data.table(START = c(1, 8, 11), END = c(4, 9, 30))
# vec1 <- c(3, 2, 8)

#
df1[, ind := .I] # add uniqe index to data.table
dt2 <- as.data.table(vec1, key = 'vec1') # convert to data.table
dt2[, vec2 := vec1] # dublicate column
setkey(df1) # sets keys // order data by all columns
# Fast overlap join:
ans1 = foverlaps(dt2, df1, by.x = c('vec1', 'vec2'), by.y = c('START', 'END'),
                 type = "within", nomatch = 0L)

counts <- ans1[, .N, keyby = ind] # count by ind
# merge to inital data
df1[, COUNT := counts[df1, on = .(ind), x.N]]
df1

setorder(df1, ind) # reorder by ind to get inital order
df1[, ind := NULL] # deletes ind colum
df1[is.na(COUNT), COUNT := 0L] # NAs is 0 count
df1
#    START END COUNT
# 1:     1   4     2
# 2:     8   9     1
# 3:    11  30     0

这篇关于如何计算向量和序列坐标数据帧之间的匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆