R-需要帮助,以按列关键字拆分或按关键字分组,然后循环进行if else计算并合并回去或取消分组 [英] R - Need help to split by column keyword or group by keyword then loop with if else calculation and merge back or ungroup

查看:49
本文介绍了R-需要帮助,以按列关键字拆分或按关键字分组,然后循环进行if else计算并合并回去或取消分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有数据框summary2.我有一个for循环,它根据当前行之前的行数来计算平均值.但是我需要基于唯一的关键字(饮料等)来执行此循环.如果不是,则第一个食物行将使用饮料中的数字进行计算.我尝试使用split和group_by,但未成功.

摘要2数据框:
关键字-匹配次数-日期
饮料-4-01-01-2016
饮料-5-01-02-2016
饮料-8-01-03-2016
饮料-4-01-04-2016
饮料-5-2016年1月5日
饮料-8-01-06-2016
饮料-4-01-07-2016
饮料-5-01-08-2016
饮料-8-2016年1月9日
饮料-4-01-10-2016
饮料-5-2016年11月11日
饮料-8-01-12-2016
食物-4-01-01-2016
食物-5-01-02-2016
食物-8-01-03-2016
食物-2016年4月1日
食物-5-01-05-2016
食物-8-01-06-2016
食物-4-01-07-2016
食物-5-01-08-2016
食物-2016年8月1日
食物-4-01-10-2016
食物-2016年5月1日至11日
食物-2016年8月1日

循环代码:

  for(i in 1:nrow(summary2)){如果(i<"3"){summary2 $ median [i] =中位数(summary2 $ hits [i:(i + 3)])}否则(i =="3"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-2)])}否则(i =="4"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-3)])}否则(i =="5"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-4)])}否则(i =="6"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-4)])}否则(i =="7"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-5)])}否则(i =="8"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-6)])}否则(i =="9"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-7)])}否则(i =="10"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-7)])}别的 {summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-8)])}} 

解决方案

某些编码轻推"我建议:

  1. 您有一个用于遍历整数的 for 循环,请勿将 i 与字符串进行比较.尽管R最终会做您认为需要的事情,但您应该了解一些事情.首先,您的不等式测试是按词典顺序进行的比较,而不是数值上的,因此 20<3 为true,"20"<"3" 不是.其次,总的来说,我认为最好是明确要使用的变量的类型/类别,第一个条件就是 if(i< 3)./p>

  2. 这是一个移动窗口的中间值,可以.但这是不一致的.这可能是设计使然,但是作为一名分析师,我很难解释一个数字是最后一个这么多值的中位数,除非它是开头,在这种情况下,它是以下值的中位数值...对于某些时间序列实践而言,这有点令人厌恶.前几个值(小于窗口大小)的滚动计算选项包括:省略它们(对 data.frame 不友好;用 NA 替换它们,直到您有足够的值为止)向量值;或者允许部分窗口(如果您期望窗口大小为5,则第一个值将是其自身的中位数,第二个将是前两个的中位数,依此类推.)

我将这些想法导入我的代码中,最后得到下面的结果.首先,R中用于滚动窗口计算的标准程序包已经很长时间是 zoo :: rollapply 函数家族了.(该领域的新手是 slider 包;我还没有使用它的经验,但是它提供了许多 zoo 所没有的功能.)

首先,我将仅在 1:12 行中的饮料" 数据上对此进行演示. 9 是窗口大小:由于您通常希望从 i-1 i-8 ,因此您需要9,即前8个加上当前值.由于您不想在中间值中包含当前值,因此我们将其排除在计算之外.

 <代码> zoo :: rollapply(summary2 $ hits [1:12],9,9,函数(z)中位数(z [-length(z)],na.rm= TRUE),对齐="right")#[1] 5 5 5 5 

有12个值,但我们只返回了4个...,这是因为它需要先对数据进行一点处理,才有足够的空间来完成一个完整的窗口或9个窗口.我建议的一种补救措施是局部窗口:

 <代码> zoo :: rollapply(summary2 $ hits [1:12],9,9,函数(z)中位数(z [-length(z)],na.rm= TRUE),align ="right",partial = TRUE)#[1] NA 4.0 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 

这就是我们将要使用的. NA 并不意外:由于一般规则是在当前值之前先处理所有内容,因此第一个没有任何内容,因此它对 median 没有值.>

由于您希望前几个看起来向前",所以稍等一下(我在上面的#2中提到过),我们将编写一个执行此操作的函数,然后补偿前几个值.

  func<-function(x,k = 9){out<-zoo :: rollapply(x,k,function(z)mid(z [-length(z)],na.rm = TRUE),align ="right",部分= TRUE)out [seq_len(min(2,length(x)))]<-中位数(head(x,4),na.rm = TRUE)出去} 

现在,我们对饮料" 的回报看起来像:

  func(summary2 $ hits [1:12])#[1] 4.5 4.5 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0 

现在,要通过关键字" 进行此操作,我们可以使用 ave :

 <代码> summary2 $ rollmedian<-ave(summary2 $ hits,summary2 $ keywords,FUN = func)摘要2#个关键字达到日期的中位数#1饮料4 2016年1月1日4.5#2喝5 2016年1月2日4.5#3喝8 2016年1月3日4.5#4喝4 2016年1月4日5.0#5喝5 2016年5月1日4.5#6喝8 2016年1月6日5.0#7喝4 2016年1月7日5.0#8喝5 2016年1月8日5.0#9喝8 2016年1月9日5.0#10喝4 2016年10月1日5.0#11喝5 2016年11月11日5.0#12喝8 2016年1月12日5.0#13食物4 01-01-2016 4.5#14食物5 01-02-2016 4.5#15食物8 01-03-2016 4.5#16食物4 01-04-2016 5.0#17食物5 01-05-2016 4.5#18食物8 01-06-2016 5.0#19食物4 01-07-2016 5.0#20食物5 01-08-2016 5.0#21食物8 01-09-2016 5.0#22食物4 2016年1月10日5.0#23食物5 01-11-2016 5.0#24食物8 01-12-2016 5.0 


数据

  summary2<-结构(列表(关键字= c("drink","drink","drink","drink",饮料",饮料",饮料",饮料",饮料",饮料",饮料",饮料",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物"),匹配数= c(4、5、8、4、5、8、4、5、4、5、8、4、5、8、4、5、8,,4、5、8、4、5、8),日期= c("01-01-2016","01-02-2016","01-03-2016","01-04"-2016","01-05-2016","01-06-2016","01-07-2016","01-08-2016","01-09-2016","; 01-10-2016","01-11-2016","01-12-2016","01-01-2016","01-02-2016","01-03-2016","01-04-2016","01-05-2016","01-06-2016","01-07-2016","01-08-2016","01-09-2016","01-10-2016","01-11-2016","01-12-2016")),类="data.frame",row.names = c(NA,-24L)) 

I currently have dataframe summary2. And I have a for loop which calculates a mean, based on the number of rows before current row. But I need to do this loop based on unique keywords (drink food etc). If not the first food rows will use numbers from drink to calculate. I tried to use split and group_by but it was not successful.

Summary2 dataframe:
keywords - hits - date
drink - 4 - 01-01-2016
drink - 5 - 01-02-2016
drink - 8 - 01-03-2016
drink - 4 - 01-04-2016
drink - 5 - 01-05-2016
drink - 8 - 01-06-2016
drink - 4 - 01-07-2016
drink - 5 - 01-08-2016
drink - 8 - 01-09-2016
drink - 4 - 01-10-2016
drink - 5 - 01-11-2016
drink - 8 - 01-12-2016
food - 4 - 01-01-2016
food - 5 - 01-02-2016
food - 8 - 01-03-2016
food - 4 - 01-04-2016
food - 5 - 01-05-2016
food - 8 - 01-06-2016
food - 4 - 01-07-2016
food - 5 - 01-08-2016
food - 8 - 01-09-2016
food - 4 - 01-10-2016
food - 5 - 01-11-2016
food - 8 - 01-12-2016

Loop code:

for (i in 1:nrow(summary2)) {
  if (i < "3") {
    summary2$median[i] = median(summary2$hits[i:(i+3)])
  }
  else if (i == "3") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-2)])
  }
  else if (i == "4") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-3)])
  }
  else if (i == "5") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-4)])
  }
  else if (i == "6") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-4)])
  }
  else if (i == "7") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-5)])
  }
  else if (i == "8") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-6)])
  }
  else if (i == "9") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-7)])
  }
  else if (i == "10") {
    summary2$median[i] = median(summary2$hits[(i-1):(i-7)])
  }
  
  else {
    summary2$median[i] = median(summary2$hits[(i-1):(i-8)])
    
  }

}

解决方案

Some coding "nudges" I suggest:

  1. You have a for loop that iterates over integers, don't compare i with a string. While R will eventually do what you think you need, there are a few things you should know. First, your inequality test is subject to lexicographic comparisons, not numeric ones, so while 20 < 3 is true, "20" < "3" is not. Second, in general I think it's just best practice to be explicit about the type/class of variables you're expecting to use, do the first conditional would be just if (i < 3).

  2. This is a moving-window median, which is fine. But it is inconsistent; this might be by-design, but as an analyst I find it difficult to explain that a number is a median of the last so-many-values unless it's at the beginning, in which case it is the median of the following values ... which is somewhat anathema to some time-series practices. Options for rolling calcs for the first few values (smaller than the window size) include: omitting them (not friendly for data.frame; replacing them with NA until you have enough values for the vector; or allow partial windows (where if you're expecting a window size of 5, then the first value would be a median of itself, second a median of the first two, etc).

I'll import those thoughts into my code, and end up with below. First, the standard package for rolling-window calcs in R has been for a long time the zoo::rollapply family of functions. (A recent newcomer to the field is the slider package; I don't have experience with it yet, but it offers many features that zoo does not.)

First, I'll demo this on just the "drink' data, in rows 1:12. The 9 is the window size: since you want from i-1 to i-8 in general, then you need 9, which is the previous 8 plus the current value. Since you don't want to include the current value in the median, we'll exclude it in the calcs.

zoo::rollapply(summary2$hits[1:12], 9, function(z) median(z[-length(z)], na.rm = TRUE), align = "right")
# [1] 5 5 5 5

There were 12 values but we only returned 4 ... that's because it needed to go a bit into the data before it had enough to do a full window or 9. One of the remedies I suggested is a partial window:

zoo::rollapply(summary2$hits[1:12], 9, function(z) median(z[-length(z)], na.rm = TRUE), align = "right", partial = TRUE)
#  [1]  NA 4.0 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0

which is what we'll use. The NA is not unexpected: since are general rule is take everything before the current, then the first one has nothing ... so it has no values to median.

Since you want the first few to look "forward" in time a little (I mentioned it in the nudge #2 above), we'll write a function that does this and then compensates for the first few values.

func <- function(x, k = 9) {
  out <- zoo::rollapply(x, k, function(z) median(z[-length(z)], na.rm = TRUE), align = "right", partial = TRUE)
  out[ seq_len(min(2, length(x))) ] <- median(head(x, 4), na.rm = TRUE)
  out
}

now our return for "drink" looks like:

func(summary2$hits[1:12])
#  [1] 4.5 4.5 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0

Now, to do this by "keywords", we can use ave:

summary2$rollmedian <- ave(summary2$hits, summary2$keywords, FUN = func)
summary2
#    keywords hits        date rollmedian
# 1     drink    4  01-01-2016        4.5
# 2     drink    5  01-02-2016        4.5
# 3     drink    8  01-03-2016        4.5
# 4     drink    4  01-04-2016        5.0
# 5     drink    5  01-05-2016        4.5
# 6     drink    8  01-06-2016        5.0
# 7     drink    4  01-07-2016        5.0
# 8     drink    5  01-08-2016        5.0
# 9     drink    8  01-09-2016        5.0
# 10    drink    4  01-10-2016        5.0
# 11    drink    5  01-11-2016        5.0
# 12    drink    8  01-12-2016        5.0
# 13    food     4  01-01-2016        4.5
# 14    food     5  01-02-2016        4.5
# 15    food     8  01-03-2016        4.5
# 16    food     4  01-04-2016        5.0
# 17    food     5  01-05-2016        4.5
# 18    food     8  01-06-2016        5.0
# 19    food     4  01-07-2016        5.0
# 20    food     5  01-08-2016        5.0
# 21    food     8  01-09-2016        5.0
# 22    food     4  01-10-2016        5.0
# 23    food     5  01-11-2016        5.0
# 24    food     8  01-12-2016        5.0


Data

summary2 <- structure(list(keywords = c("drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food "), hits = c(4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8), date = c(" 01-01-2016", " 01-02-2016", " 01-03-2016", " 01-04-2016", " 01-05-2016", " 01-06-2016", " 01-07-2016", " 01-08-2016", " 01-09-2016", " 01-10-2016", " 01-11-2016", " 01-12-2016", "01-01-2016", "01-02-2016", "01-03-2016", "01-04-2016", "01-05-2016", "01-06-2016", "01-07-2016", "01-08-2016", "01-09-2016", "01-10-2016", "01-11-2016", "01-12-2016")), class = "data.frame", row.names = c(NA, -24L))

这篇关于R-需要帮助,以按列关键字拆分或按关键字分组,然后循环进行if else计算并合并回去或取消分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆