避免在R中使用lapply(),并找到向量B的所有元素,满足向量A的每个元素的条件 [英] Avoiding lapply() in R, and finding all elements of Vector B that meet a condition of for each element of Vector A

查看:95
本文介绍了避免在R中使用lapply(),并找到向量B的所有元素,满足向量A的每个元素的条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个向量.对于向量 A 的每个元素,我想知道向量 B 的满足特定条件的所有元素.因此,例如,两个包含向量的数据帧:

I have two vectors. For each element of vector A, I would like to know all the elements of vector B that fulfill a certain condition. So, for example, two dataframes containing the vectors:

person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
                 tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand", 
                         "moodring", "figurine", "whistle", "saxophone"),
                cost = c(6, 11, 13, 17, 21, 23, 25, 30))

在此示例中,人" 数据框中的每个人都有大量狂欢游戏的门票,而奖品" 数据框中的每个奖品都有一个成本.但是我不是在寻找完美的搭配.他们不只是简单地购买奖品,而是随机获得在其所持票券的5英镑成本容限范围内的任何奖品.

For this example, each person in the "person" dataframe has a number of tickets from a carnival game, and each prize in the "prize" dataframe has a cost. But I'm not looking for perfect matches; instead of simply buying a prize, they randomly receive any prize that is within a 5-ticket cost tolerance of what they have.

我正在寻找的输出是每个人都可能赢得的所有奖品的数据框.就像这样:

The output I'm looking for is a dataframe of all the possible prizes each person could win. It would be something like:

    person      prize
1   Albert stickyhand
2   Albert   moodring
3   Albert   figurine
4   Albert    whistle
5    Becca   moodring
6    Becca   figurine
       ...        ...

以此类推.现在,我正在使用lapply()进行此操作,但这确实不比R中的for()循环快.

And so on. Right now, I'm doing this with lapply(), but this is really no faster than a for() loop in R.

library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
  matchlist <- lapply(split(person, list(person$name)),
                      function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
  longlist <- data.frame("person" = rep(names(matchlist), 
                                    times = unlist(lapply(matchlist, length))),
                         "prize" = unname(unlist(matchlist))
  )
  return(longlist)
}
matching_Function(person, prize)

我的实际数据集更大(成千上万),我的匹配条件也更加复杂(从 B 检查坐标以查看它们是否在的设定坐标范围内> A ),因此这将花费永远(几个小时).

My actual datasets are much larger (in the hundreds of thousands), and my matching conditions are more complicated (checking coordinates from B to see whether they are within a set radius of coordinates from A), so this is taking forever (several hours).

有没有比for()lapply()更聪明的方法来解决这个问题?

Are there any smarter ways than for() and lapply() to solve this?

推荐答案

data.table中的foverlaps替代品可以满足您的要求:

An alternative with foverlaps from data.table doing what you wish:

require(data.table)

# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)

输出:

       name      prize
 1:  Albert stickyhand
 2:  Albert   moodring
 3:  Albert   figurine
 4:  Albert    whistle
 5:   Becca   moodring
 6:   Becca   figurine
 7:   Becca    whistle
 8:  Celine   lollipop
 9:  Celine      yo-yo
10:  Celine stickyhand
11:  Celine   moodring
12: Dagwood      yo-yo
13: Dagwood stickyhand
14: Dagwood   moodring

我希望我对代码的注释足以引起人们的自我解释.

I hope I commented enough the code to be self explanatory.

对于问题的第二部分,使用坐标并在半径范围内进行测试.

For the second part of the question, using coordinates and testing within a radius.

person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"), 
                         x = c(26, 16, 32, 51), 
                         y = c(92, 51, 25, 4)), 
                    .Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"), 
                          x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17), 
                          y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)), 
                     .Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")

setDT(person)
setDT(antenas)
r<-10

results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]

Data.table允许在j中使用表达式,因此我们可以针对每个人对天线进行外部联接的数学运算,并仅返回具有天线名称的相关行.

Data.table allow expression in j, so we can do the maths of the outer join for each person against antennas and return only relevant rows with antenna name.

这应该不占用太多内存,因为它是针对人的每一行而不是整个行完成的.

This should not be to much memory consuming as it's done for each row on person and not as a whole.

此问题启发的数学

这给了

> results
     name antena
1:  Becca      L
2: Celine      G
3: Celine      H

这篇关于避免在R中使用lapply(),并找到向量B的所有元素,满足向量A的每个元素的条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆