R - 在这种情况下,如何使用 apply family 函数进行矢量化并避免 while/for 循环? [英] R - How to vectorize with apply family function and avoid while/for loops in this case?

查看:35
本文介绍了R - 在这种情况下,如何使用 apply family 函数进行矢量化并避免 while/for 循环?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这种情况下(更多细节可以在这个问题中找到:计算其余 dat 中有多少观测值符合多个条件?(R))

In this case (more details could be found in this question: Count how many observations in the rest of the dat fits multiple conditions? (R))

这是一个名为事件的数据集,包含数千个事件(观察)我选择了几行来向你展示数据结构体.它包含STATEid"、发生日期"和两个变量LON"LAT"中的地理坐标.我写信给为每一行计算一个新变量(列).这个新变量应该是:给定任何特定事件,计算数据集的其余部分并计算在相同状态下发生的事件数量,50/100KM半径范围内,未来30/60天.

This is a dataset called event, containing thousands of events (observations) and I selected several rows to show you the data structure. It contains the "STATEid", "date" of occurrence, and geographical coordinates in two variables "LON" "LAT". I am writing to calculate a new variable (column) for each row. This new variable should be: "Given any specific incident, count the rest of the dataset and calculate the number of events that's happened in the same state, within the circle of 50/100KM radius, in the next 30/60 days.

我编写了一个带有 while 循环的用户定义函数 - 为方便起见,我只包含了 2 个条件,在 30 天内处于相同状态:

I have written a user-defined function with a while loop - to make this easier, I only included 2 conditions, within 30 days and in the same state:

n = 1

f = function(i) {
  a = i[n,]
  b = a$date
  # c = a$LON
  # d = a$LAT
  e = a$STATEid
  f = a$RID
  g1 = sum(i$CASE  [i$date<= b+30 & i$date>b & i$STATEid==e], na.rm=T)
  # g2 = sum(i$viold [i$date<= b+30 & i$date>b], na.rm=T)
  # g3 = sum(i$CASE  [i$date<= b+60 & i$date>b], na.rm=T)
  # g4 = sum(i$viold [i$date<= b+60 & i$date>b], na.rm=T)
  # h = cbind(g1, g2, g3, g4)
  g1 = data.frame(g1)
  n = n+1
  assign(as.character(f), g1, envir = .GlobalEnv)
}

for (n in 1:20) (f(event2))

for (n in 1:20) (f(event2))

因为它包含 23,000 个案例,所以花费了很长时间.当循环只需要运行两次时,我的带有 16GB Ram 的 PC 无法固定它!所以我认为避免循环会更可取.你能提出一种方法来矢量化我的代码并避免循环吗?

This has been taking too long since it contains 23,000 cases. My PC with a 16GB Ram cannot nail it when the loop only needs to run twice! So I think avoiding loops would be preferable. Could you please suggest a way to vectorize my codes and avoid the looping?

我的主要问题是,当我需要在需要多个条件时正确引用每一行和每个变量时,我不知道如何编写用户定义的问题 - 这就是为什么在我的循环函数中,我创建了诸如"a", "b", "c", "d", "e" 来正确调用它们......效率低下 - 我知道......

My main problem is that I don't know how to write user-defined problems when I need to refer to each row, each variable properly when multiple conditions are required - that's why in my loop function, I created objects such as "a", "b", "c", "d", "e" to call them properly... Inefficient - I know...

我的 dput 结果如下:

My dput outcome looks like so:

     > dput(tail(event2[,c("RID", "STATEid", "date", "LON", "LAT")]))
structure(list(RID = c("023610", "023611", "023613", "023614", 
"023615", "023616"), STATEid = structure(c(36L, 36L, 23L, 23L, 
5L, 14L), .Label = c("alabama", "alaska", "arizona", "arkansas", 
"california", "colorado", "connecticut", "delaware", "district of columbia", 
"florida", "georgia", "hawaii", "idaho", "illinois", "indiana", 
"iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", 
"massachusetts", "michigan", "minnesota", "mississippi", "missouri", 
"montana", "nebraska", "nevada", "new hampshire", "new jersey", 
"new mexico", "new york", "north carolina", "north dakota", "ohio", 
"oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", 
"south dakota", "tennessee", "texas", "utah", "vermont", "virginia", 
"washington", "west virginia", "wisconsin", "wyoming"), class = "factor"), 
    date = structure(c(3620, -633, 131, -315, 5421, 3558), class = "Date"), 
    LON = c(-80.6495194, -80.6495194, -83.6129939, -83.6129939, 
    -121.6169108, -87.8328505), LAT = c(41.0997803, 41.0997803, 
    42.2411499, 42.2411499, 39.1404477, 42.4461322)), .Names = c("RID", 
"STATEid", "date", "LON", "LAT"), row.names = c(23610L, 23611L, 
23613L, 23614L, 23615L, 23616L), class = "data.frame")
> 

非常感谢.感谢您的帮助.

Many thanks. I appreciate your help.

最好,

---------- 2018年1月20日更新---------

---------- Jan.20, 2018 updates ---------

我创建了一个可以正常工作并正确反映我希望的循环:

I have created a loop that works and correctly reflects what I am hoping for:

g = event2[FALSE,]

USERFUN = function(i) {
  a = i[n,] # retrieve each row from the object, make it a data object
  b = a$date # get date
  # c = a$LON # for now I dropped the idea of calculating radius
  # d = a$LAT # for now I dropped the idea of calculating radius
  e = a$STATEid # get STATE
  f = a$RID # get case ID to name the objects generated!

  PostAct30 = sum(i$CASE [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - i is the entire dataset 
  PostAct60 = sum(i$CASE [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T) # multiple conditions defined here - b, e are dynamic, retrieving from each line!!!
  PreAct30 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreAct60 = sum(i$CASE [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PostVio30 = sum(i$viold [i$date<= b+30 & i$date>b & i$STATEid == e], na.rm=T)
  PostVio60 = sum(i$viold [i$date<= b+60 & i$date>b & i$STATEid == e], na.rm=T)
  PreVio30 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  PreVio60 = sum(i$viold [i$date<= b & i$date>b-30 & i$STATEid == e], na.rm=T)
  g1 = data.frame(f, PostAct30, PostAct60, PreAct30, PreAct60, PostVio30, PostVio60, PreVio30, PreVio60)
  n = n+1
  return(g1)
  }
# sum(event2$ca)
n = 1
for (n in 1:19446) {
  g2 = USERFUN(event2)
  g = rbind(g, g2)        
}

并且输出看起来像这样:

AND the output looks like so:

> tail(event3 [c("date","STATEid", "PostAct30", "PostAct60", "PostVio30", "PostVio60")])
            date    STATEid PostAct30 PostAct60 PostVio30 PostVio60
23611 1968-04-08       ohio         3         4         0         0
23612       <NA>    arizona        NA        NA        NA        NA
23613 1970-05-12   michigan         4         6         2         4
23614 1969-02-20   michigan         2         3         1         1
23615 1984-11-04 california         4         5         0         0
23616 1979-09-29   illinois         0         2         0         1

推荐答案

考虑 mapply 通过迭代 dateSTATEid 在适当的位置添加新列> 逐元素转换为定义的函数.具体来说,mapply 生成一个包含 7 列的矩阵,您将其分配给 event2.

Consider mapply to add new columns in place by iterating date and STATEid elementwise into a defined function. Specifically, mapply produces a matrix of 7 columns which you assign to event2.

dates_calc_fct <- function(b, e) 
  c(sum(event2$CASE [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$CASE [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+30 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b+60 & event2$date>b & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T),
    sum(event2$viold [event2$date<= b & event2$date>b-30 & event2$STATEid == e], na.rm=T)
   )

event2[c("PostAct30", "PostAct60", 
         "PreAct30", "PreAct60",
         "PostVio30", "PostVio60", 
         "PreVio30", "PreVio60")] <- mapply(dates_calc_fct, event$date, event$STATEid)

这篇关于R - 在这种情况下,如何使用 apply family 函数进行矢量化并避免 while/for 循环?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆