R如何在多个条件下对向量进行向量化 [英] R how to vectorize a function with multiple if else conditions
问题描述
我是R语言中向量化功能的新手.我有类似以下的代码.
Hi I am new to vectorizing functions in R. I have a code similar the following.
library(truncnorm)
library(microbenchmark)
num_obs=10000
Observation=seq(1,num_obs)
Obs_Type=sample(1:4, num_obs, replace=T)
Upper_bound = runif(num_obs,0,1)
Lower_bound=runif(num_obs,2,4)
mean = runif(num_obs,10,15)
df1= data.frame(Observation,Obs_Type,Upper_bound,Lower_bound,mean)
df1$draw_value = 0
Trial_func=function(df1){
for (i in 1:nrow(df1)){
if (df1[i,"Obs_Type"] ==1){
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-Inf,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if (df1[i,"Obs_Type"] ==2){
#If Type == 2; then a=-10, b = Upper_Bound
df1[i,"draw_value"] = rtruncnorm(1,a=-10,b=df1[i,"Upper_bound"],mean= df1[i,"mean"],sd=1)
} else if(df1[i,"Obs_Type"] ==3){
#If Type == 3; then a=Lower_bound, b = Inf
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=Inf,mean= df1[i,"mean"],sd=1)
} else {
#If Type == 3; then a=Lower_bound, b = 10
df1[i,"draw_value"] = rtruncnorm(1,a=df1[i,"Lower_bound"],b=10,mean= df1[i,"mean"],sd=1)
}
}
return(df1)
}
#Benchmarking
mbm=microbenchmark(Trial_func(df1=df1),times = 10)
summary(mbm)
#For obtaining the new data
New_data=Trial_func(df1=df1)
在上面,我最初创建了一个名为df1的数据框.然后,我创建一个接受数据集(df1)的函数.数据集中的每个观测值(df1)可以是四种类型之一.这由df1 $ Obs_Type给出.我想做的是基于Obs_Type,我想从具有给定的上下点的截断正态分布中绘制值.
In the above I am creating a dataframe called df1 initially. I then create a function which takes a dataset (df1). Each observation in the dataset (df1), can be one of four types. This is given by df1$Obs_Type. What I want to do is that based on the Obs_Type, I want to draw values from a truncated normal distribution with a given upper and lower points.
规则是:
a)当Obs_Type = 1时;a = -Inf,b =观测值i的上限.
a) When Obs_Type =1; a=-Inf, b = Upper_bound value of observation i.
b)当Obs_Type = 2时;a = -10,b =观测值i的上限.
b) When Obs_Type =2; a=-10, b = Upper_bound value of observation i.
c)当Obs_Type = 3时;a =观测值i的上限,b = Inf.
c) When Obs_Type =3; a=Upper_bound value of observation i, b = Inf.
d)当Obs_Type = 4时;a =观测值i的上限,b = 10.
d) When Obs_Type =4; a=Upper_bound value of observation i, b = 10.
其中a =下限,b =上限;另外,观测平均值i由df1 $ mean和sd = 1给出.
Where a = lower bound, b = upper bound; Additionally, mean of observation i is given by df1$mean and sd = 1.
我对向量化并不熟悉,想知道是否有人可以帮助我.我尝试查看SO上的其他示例(例如,此),但是当我有多个条件时却不知道该怎么办.
I am not familiar with vectorizing and was wondering if someone could help me with this a bit. I tried looking at some other examples on SO (for eg. this) but could not figure out what to do when I have multiple conditions.
我的原始数据集有大约一千万个观测值和其他附加条件(例如,我的数据不是16种类型,而是4种类型,而每种类型的均值都在变化),但是我在这里使用了一个简单的示例.
My original dataset has about 10 million observations and other additional conditions (eg. instead of 4 types, my data has 16 types and the means changes with each type), but I used a simpler example here.
请让我知道问题的任何部分是否需要任何其他说明.
Please let me know if any part of the question requires any additional clarification.
推荐答案
这里是矢量化方法.它创建对应于4个条件的逻辑向量 i1
, i2
, i3
和 i4
.然后,它将新值分配给它们所索引的位置.
Here is a vectorized way. It creates logical vectors i1
, i2
, i3
and i4
corresponding to the 4 conditions. Then it assigns the new values to the positions indexed by them.
Trial_func2 <- function(df1){
i1 <- df1[["Obs_Type"]] == 1
i2 <- df1[["Obs_Type"]] == 2
i3 <- df1[["Obs_Type"]] == 3
i4 <- df1[["Obs_Type"]] == 4
#If Type == 1; then a=-Inf, b = Upper_Bound
df1[i1, "draw_value"] <- rtruncnorm(sum(i1), a =-Inf,
b = df1[i1, "Upper_bound"],
mean = df1[i1, "mean"], sd = 1)
#If Type == 2; then a=-10, b = Upper_Bound
df1[i2, "draw_value"] <- rtruncnorm(sum(i2), a = -10,
b = df1[i2 , "Upper_bound"],
mean = df1[i2, "mean"], sd = 1)
#If Type == 3; then a=Lower_bound, b = Inf
df1[i3,"draw_value"] <- rtruncnorm(sum(i3),
a = df1[i3, "Lower_bound"],
b = Inf, mean = df1[i3, "mean"],
sd = 1)
#If Type == 3; then a=Lower_bound, b = 10
df1[i4, "draw_value"] <- rtruncnorm(sum(i4),
a = df1[i4, "Lower_bound"],
b = 10,
mean = df1[i4,"mean"],
sd = 1)
df1
}
在速度测试中,我已将 @ Dave2e的答案命名为 Trial_func3
.
In the speed test I have named @Dave2e's answer Trial_func3
.
mbm <- microbenchmark(
loop = Trial_func(df1 = df1),
vect = Trial_func2(df1 = df1),
cwhen = Trial_func3(df1 = df1),
times = 10)
print(mbm, order = "median")
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# vect 4.349444 4.371169 4.40920 4.401384 4.450024 4.487453 10 a
# cwhen 13.458946 13.484247 14.16045 13.528792 13.787951 19.363104 10 a
# loop 2125.665690 2138.792497 2211.20887 2157.185408 2201.391083 2453.658767 10 b
这篇关于R如何在多个条件下对向量进行向量化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!