在访问数据帧的多个变量时,通过数据帧 R 行矢量化循环 [英] Vectorizing a loop through lines of data frame R while accessing multiple variables the dataframe
问题描述
又一个apply
问题.
我已经查看了很多关于 R 中 apply
函数系列的文档(并在我的工作中大量使用它们).我在下面定义了一个函数 myfun
,我想将其应用于数据帧 inc
的每一行.我想我需要一些 apply(inc,1,myfun)
的变体我已经玩了一段时间了,但仍然不能完全理解它.我已经包含了一个循环,它完全实现了我想要做的事情......它对我的真实数据来说非常慢且效率低下,这些数据比我在此处包含的示例数据大得多.
I've reviewed a lot of documentation on the apply
family of functions in R (and use them quite a bit in my work). I've defined a function myfun
below which I want to apply to every row of the dataframe inc
. I think I need some variant of apply(inc,1,myfun)
I've played around with it for a while, but still can't quite get it. I've included a loop which achieves exactly what I want to do... it's just super slow and inefficient on my real data which is considerably larger than the sample data I've included here.
我希望这是一个快速修复,但我不能完全解决它...也许有特殊参数 ...
可以应用?
I expect it's a quick fix, but I can't quite put my finger on it... maybe something with special argument ...
to apply?
以下代码的英文版本:我想查看 inc
数据框中的所有提交日期,并为每个日期查找 chg中的行数code> 有
chg$Submit.Date
在 inc$Submit.Date
的某个范围内的地方.其中范围由 myfun
English version of what the code below does: I want to look at all the Submit Dates in the inc
dataframe and find for each of these dates, how many rows in chg
there are where chg$Submit.Date
is within some range of the inc$Submit.Date
. Where the range is controlled by fdays
and bdays
in myfun
chgdf <- data.frame(Submit.Date=as.Date(c("2013-09-27", "2013-09-4", "2013-08-01", "2013-06-24", '2013-05-29', '2013-08-20')), ID=c('001', '001', '001', '001', '001', '005'), stringsAsFactors=F)
incdf <- data.frame(Submit.Date=as.Date(c("2013-10-19", "2013-09-14", "2013-08-22", '2013-08-20')), ID=c('001', '001', '002', '006'), stringsAsFactors=F)
我想应用于数据框每一行的函数
myfun <- function(tdate, aid, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chg2 <- chg[chg$ID==aid & chg$Submit.Date<fdays & chg$Submit.Date>bdays, ]
ret <- nrow(chg2)
return(ret)
}
适用于一行 inc 数据框
tdate <- inc[inc$ID==aid, 'Submit.Date'][1]
myfun(tdate, aid='001', bdays=50, fdays=100)
工作但很慢......使用完整的数据集
inc$chgw <- 0
for(i in 1:nrow(inc)){
aid <- inc$ID[i]
tdate <- inc$Submit.Date[i]
inc$chgw[i] <- myfun(tdate, aid, bdays=50, fdays=100)
}
推荐答案
首先,当你调用 apply
时所有的值都被强制转换为字符串,所以你需要转换 tdate
在使用它之前.否则,您将尝试向字符串添加天数:
First, when you call apply
all values are coerced to strings, so you need to convert tdate
before using it. Otherwise you're trying to add days to a string:
tdate <- as.Date(tdate)
fdays <- tdate+fdays
bdays <- tdate-bdays
其次,您调用 apply(inc, 1, myfun)
.请注意,在这种情况下,您将单个参数传递给 myfun
(整行),而不是 myfun
应该接收的多个参数.
Second, you call apply(inc, 1, myfun)
. Note that in that case you're passing a single parameter to myfun
(the whole row), and not several parameters as myfun
is supposed to receive.
解决方案 1: 更改您的函数以接收一整行数据框并像您一样调用:
Solution 1: Change your function to receive a whole row of the dataframe and call as you did:
myfun <- function(row, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
tdate <- as.Date(row[1])
fdays <- tdate+fdays
bdays <- tdate-bdays
chgdf2 <- chgdf[chgdf$ID==row[2] & chgdf$Submit.Date<fdays & chgdf$Submit.Date>bdays, ]
ret <- nrow(chgdf2)
return(ret)
}
> apply(inc, 1, myfun)
[1] 1 2 0 0
方案二:使用函数调用中的所有参数调用apply
:
Solution 2: Call apply
using all parameters in the function call:
myfun <- function(tdate, aid, chg=chgdf, inc=incdf, fdays=30, bdays=30) {
fdays <- tdate+fdays
bdays <- tdate-bdays
chgdf2 <- chgdf[chgdf$ID==aid & chgdf$Submit.Date<fdays & chgdf$Submit.Date>bdays, ]
ret <- nrow(chgdf2)
return(ret)
}
> apply(inc, 1, function(row) myfun(as.Date(row[1]), row[2]))
[1] 1 2 0 0
我个人更喜欢第二种解决方案,因为它使您可以在 myfun
中更改其他参数的默认值:
I personally prefer the second solution, because it gives you the possibility to change the default values of your other parameters in myfun
:
> apply(inc, 1, function(row) myfun(as.Date(row[1]), row[2], bdays=50, fdays=50))
[1] 2 3 0 0
这篇关于在访问数据帧的多个变量时,通过数据帧 R 行矢量化循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!