比“while"更快的方法循环查找 R 中的感染链 [英] Faster method than "while" loop to find chain of infection in R

查看:43
本文介绍了比“while"更快的方法循环查找 R 中的感染链的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在分析存储由疾病模拟模型输出的数据的大型表(300 000 - 500 000 行).在模型中,景观中的动物会感染其他动物.例如,在下图中的示例中,动物 a1 感染了景观中的每只动物,并且感染从动物移动到动物,分支为感染链".

I'm analyzing large tables (300 000 - 500 000 rows) that store data output by a disease simulation model. In the model, animals on a landscape infect other animals. For example, in the example pictured below, animal a1 infects every animal on the landscape, and the infection moves from animal to animal, branching off into "chains" of infection.

在下面的示例中,我想获取存储有关每种动物信息的表(在下面的示例中,table = allanimals)并仅切出信息关于动物 d2 的感染链(我用绿色突出显示了 d2 的感染链)所以我可以计算该感染链的平均栖息地值.

In my example below, I want to take the table that stores information about each animal (in my example below, table = allanimals) and slice out just the information about animal d2's chain of infection (I've highlighted d2's chain in green) so I can calculate the average habitat value for that chain of infection.

虽然我的 while 循环有效,但是当表存储数十万行并且链有 40-100 个成员时,它就像糖蜜一样缓慢.

Although my while loop works, it is slow like molasses when the table stores hundreds of thousands of rows, and the chain has 40-100 members.

关于如何加快速度的任何想法?希望有 tidyverse 解决方案.我知道我的示例数据集看起来足够快",但我的数据确实很慢......

Any ideas on how to speed this up? Hoping for a tidyverse solution. I know it "looks fast enough" with my example dataset, but it really is slow with my data...

示意图:

以下示例数据的预期输出:

   AnimalID InfectingAnimal habitat
1        d2              d1       1
2        d1              c3       1
3        c3              c2       3
4        c2              c1       2
5        c1              b3       3
6        b3              b2       6
7        b2              b1       5
8        b1              a2       4
9        a2              a1       2
10       a1               x       1

示例代码:

library(tidyverse)

# make some data
allanimals <- structure(list(AnimalID = c("a1", "a2", "a3", "a4", "a5", "a6", "a7", "a8",
"b1", "b2", "b3", "b4", "b5", "c1", "c2", "c3", "c4", "d1", "d2", "e1", "e2",
"e3", "e4", "e5", "e6", "f1", "f2", "f3", "f4", "f5", "f6", "f7"),
InfectingAnimal = c("x", "a1", "a2", "a3", "a4", "a5", "a6", "a7", "a2", "b1",
"b2", "b3", "b4", "b3", "c1", "c2", "c3", "c3", "d1", "b1", "e1", "e2", "e3",
"e4", "e5", "e1", "f1", "f2", "f3", "f4", "f5", "f6"), habitat = c(1L, 2L, 1L,
2L, 2L, 1L, 3L, 2L, 4L, 5L, 6L, 1L, 2L, 3L, 2L, 3L, 2L, 1L, 1L, 2L, 5L, 4L,
1L, 1L, 1L, 1L, 4L, 5L, 4L, 5L, 4L, 3L)), .Names = c("AnimalID",
"InfectingAnimal", "habitat"), class = "data.frame", row.names = c(NA, -32L))

# check it out
head(allanimals)

# Start with animal I'm interested in - say, d2
Focal.Animal <- "d2"

# Make a 1-row data.frame with d2's information
Focal.Animal <- allanimals %>% 
  filter(AnimalID == Focal.Animal)

# This is the animal we start with
Focal.Animal

# Make a new data.frame to store our results of the while loop in
Chain <- Focal.Animal

# make a condition to help while loop
InfectingAnimalInTable <- TRUE

# time it 
ptm <- proc.time()

# Run loop until you find an animal that isn't in the table, then stop
while(InfectingAnimalInTable == TRUE){
    # Who is the next infecting animal?
    NextAnimal <- Chain %>% 
      slice(n()) %>% 
      select(InfectingAnimal) %>% 
      unlist()

    NextRow <- allanimals %>% 
      filter(AnimalID == NextAnimal)


    # If there is an infecting animal in the table, 
    if (nrow(NextRow) > 0) {
      # Add this to the Chain table
      Chain[(nrow(Chain)+1),] <- NextRow
      #Otherwise, if there is no infecting animal in the  table, 
      # define the Infecting animal follows, this will stop the loop.
    } else {InfectingAnimalInTable <- FALSE}
  }

proc.time() - ptm

# did it work? Check out the Chain data.frame
Chain

推荐答案

所以这里的问题在于你的数据结构.您将需要一个向量来存储谁被谁感染(将谁保留为整数):

So the problem here is with your data structure. You will need a vector that stores who is infected by who (keeping the who as integers):

allanimals_ID <- unique(c(allanimals$AnimalID, allanimals$InfectingAnimal))

infected <- rep(NA_integer_, length(allanimals_ID))
infected[match(allanimals$AnimalID, allanimals_ID)] <-
  match(allanimals$InfectingAnimal, allanimals_ID)

path <- rep(NA_integer_, length(allanimals_ID))
curOne <- match("d2", allanimals_ID)
i <- 1
while (!is.na(nextOne <- infected[curOne])) {
  path[i] <- curOne
  i <- i + 1
  curOne <- nextOne
}

allanimals[path[seq_len(i - 1)], ]

为了获得额外的性能提升,请使用 Rcpp 重新编码此循环 :')

For extra performance gain, recode this loop with Rcpp :')

这篇关于比“while"更快的方法循环查找 R 中的感染链的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆