R-需要帮助,以按列关键字拆分或按关键字分组,然后循环进行if else计算并合并回去或取消分组 [英] R - Need help to split by column keyword or group by keyword then loop with if else calculation and merge back or ungroup
问题描述
我目前有数据框summary2.我有一个for循环,它根据当前行之前的行数来计算平均值.但是我需要基于唯一的关键字(饮料等)来执行此循环.如果不是,则第一个食物行将使用饮料中的数字进行计算.我尝试使用split和group_by,但未成功.
摘要2数据框:
关键字-匹配次数-日期
饮料-4-01-01-2016
饮料-5-01-02-2016
饮料-8-01-03-2016
饮料-4-01-04-2016
饮料-5-2016年1月5日
饮料-8-01-06-2016
饮料-4-01-07-2016
饮料-5-01-08-2016
饮料-8-2016年1月9日
饮料-4-01-10-2016
饮料-5-2016年11月11日
饮料-8-01-12-2016
食物-4-01-01-2016
食物-5-01-02-2016
食物-8-01-03-2016
食物-2016年4月1日
食物-5-01-05-2016
食物-8-01-06-2016
食物-4-01-07-2016
食物-5-01-08-2016
食物-2016年8月1日
食物-4-01-10-2016
食物-2016年5月1日至11日
食物-2016年8月1日
循环代码:
for(i in 1:nrow(summary2)){如果(i<"3"){summary2 $ median [i] =中位数(summary2 $ hits [i:(i + 3)])}否则(i =="3"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-2)])}否则(i =="4"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-3)])}否则(i =="5"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-4)])}否则(i =="6"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-4)])}否则(i =="7"){summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-5)])}否则(i =="8"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-6)])}否则(i =="9"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-7)])}否则(i =="10"){summary2 $ median [i] =中位数(summary2 $ hits [(i-1):( i-7)])}别的 {summary2 $ median [i] =中位数(summary2 $ hits [[i-1):( i-8)])}}
某些编码轻推"我建议:
-
您有一个用于遍历整数的
for
循环,请勿将i
与字符串进行比较.尽管R最终会做您认为需要的事情,但您应该了解一些事情.首先,您的不等式测试是按词典顺序进行的比较,而不是数值上的,因此20<3
为true,"20"<"3"
不是.其次,总的来说,我认为最好是明确要使用的变量的类型/类别,第一个条件就是if(i< 3)
./p> -
这是一个移动窗口的中间值,可以.但这是不一致的.这可能是设计使然,但是作为一名分析师,我很难解释一个数字是最后一个这么多值的中位数,除非它是开头,在这种情况下,它是以下值的中位数值...对于某些时间序列实践而言,这有点令人厌恶.前几个值(小于窗口大小)的滚动计算选项包括:省略它们(对
data.frame
不友好;用NA
替换它们,直到您有足够的值为止)向量值;或者允许部分窗口(如果您期望窗口大小为5,则第一个值将是其自身的中位数,第二个将是前两个的中位数,依此类推.)
我将这些想法导入我的代码中,最后得到下面的结果.首先,R中用于滚动窗口计算的标准程序包已经很长时间是 zoo :: rollapply
函数家族了.(该领域的新手是 slider
包;我还没有使用它的经验,但是它提供了许多 zoo
所没有的功能.)
首先,我将仅在 1:12
行中的饮料"
数据上对此进行演示. 9
是窗口大小:由于您通常希望从 i-1
到 i-8
,因此您需要9,即前8个加上当前值.由于您不想在中间值中包含当前值,因此我们将其排除在计算之外.
<代码> zoo :: rollapply(summary2 $ hits [1:12],9,9,函数(z)中位数(z [-length(z)],na.rm= TRUE),对齐="right")#[1] 5 5 5 5
有12个值,但我们只返回了4个...,这是因为它需要先对数据进行一点处理,才有足够的空间来完成一个完整的窗口或9个窗口.我建议的一种补救措施是局部窗口:
<代码> zoo :: rollapply(summary2 $ hits [1:12],9,9,函数(z)中位数(z [-length(z)],na.rm= TRUE),align ="right",partial = TRUE)#[1] NA 4.0 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0
这就是我们将要使用的. NA
并不意外:由于一般规则是在当前值之前先处理所有内容,因此第一个没有任何内容,因此它对 median
没有值.>
由于您希望前几个看起来向前",所以稍等一下(我在上面的#2中提到过),我们将编写一个执行此操作的函数,然后补偿前几个值.
func<-function(x,k = 9){out<-zoo :: rollapply(x,k,function(z)mid(z [-length(z)],na.rm = TRUE),align ="right",部分= TRUE)out [seq_len(min(2,length(x)))]<-中位数(head(x,4),na.rm = TRUE)出去}
现在,我们对饮料"
的回报看起来像:
func(summary2 $ hits [1:12])#[1] 4.5 4.5 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0
现在,要通过关键字"
进行此操作,我们可以使用 ave
:
<代码> summary2 $ rollmedian<-ave(summary2 $ hits,summary2 $ keywords,FUN = func)摘要2#个关键字达到日期的中位数#1饮料4 2016年1月1日4.5#2喝5 2016年1月2日4.5#3喝8 2016年1月3日4.5#4喝4 2016年1月4日5.0#5喝5 2016年5月1日4.5#6喝8 2016年1月6日5.0#7喝4 2016年1月7日5.0#8喝5 2016年1月8日5.0#9喝8 2016年1月9日5.0#10喝4 2016年10月1日5.0#11喝5 2016年11月11日5.0#12喝8 2016年1月12日5.0#13食物4 01-01-2016 4.5#14食物5 01-02-2016 4.5#15食物8 01-03-2016 4.5#16食物4 01-04-2016 5.0#17食物5 01-05-2016 4.5#18食物8 01-06-2016 5.0#19食物4 01-07-2016 5.0#20食物5 01-08-2016 5.0#21食物8 01-09-2016 5.0#22食物4 2016年1月10日5.0#23食物5 01-11-2016 5.0#24食物8 01-12-2016 5.0
数据
summary2<-结构(列表(关键字= c("drink","drink","drink","drink",饮料",饮料",饮料",饮料",饮料",饮料",饮料",饮料",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物",食物"),匹配数= c(4、5、8、4、5、8、4、5、4、5、8、4、5、8、4、5、8,,4、5、8、4、5、8),日期= c("01-01-2016","01-02-2016","01-03-2016","01-04"-2016","01-05-2016","01-06-2016","01-07-2016","01-08-2016","01-09-2016","; 01-10-2016","01-11-2016","01-12-2016","01-01-2016","01-02-2016","01-03-2016","01-04-2016","01-05-2016","01-06-2016","01-07-2016","01-08-2016","01-09-2016","01-10-2016","01-11-2016","01-12-2016")),类="data.frame",row.names = c(NA,-24L))
I currently have dataframe summary2. And I have a for loop which calculates a mean, based on the number of rows before current row. But I need to do this loop based on unique keywords (drink food etc). If not the first food rows will use numbers from drink to calculate. I tried to use split and group_by but it was not successful.
Summary2 dataframe:
keywords - hits - date
drink - 4 - 01-01-2016
drink - 5 - 01-02-2016
drink - 8 - 01-03-2016
drink - 4 - 01-04-2016
drink - 5 - 01-05-2016
drink - 8 - 01-06-2016
drink - 4 - 01-07-2016
drink - 5 - 01-08-2016
drink - 8 - 01-09-2016
drink - 4 - 01-10-2016
drink - 5 - 01-11-2016
drink - 8 - 01-12-2016
food - 4 - 01-01-2016
food - 5 - 01-02-2016
food - 8 - 01-03-2016
food - 4 - 01-04-2016
food - 5 - 01-05-2016
food - 8 - 01-06-2016
food - 4 - 01-07-2016
food - 5 - 01-08-2016
food - 8 - 01-09-2016
food - 4 - 01-10-2016
food - 5 - 01-11-2016
food - 8 - 01-12-2016
Loop code:
for (i in 1:nrow(summary2)) {
if (i < "3") {
summary2$median[i] = median(summary2$hits[i:(i+3)])
}
else if (i == "3") {
summary2$median[i] = median(summary2$hits[(i-1):(i-2)])
}
else if (i == "4") {
summary2$median[i] = median(summary2$hits[(i-1):(i-3)])
}
else if (i == "5") {
summary2$median[i] = median(summary2$hits[(i-1):(i-4)])
}
else if (i == "6") {
summary2$median[i] = median(summary2$hits[(i-1):(i-4)])
}
else if (i == "7") {
summary2$median[i] = median(summary2$hits[(i-1):(i-5)])
}
else if (i == "8") {
summary2$median[i] = median(summary2$hits[(i-1):(i-6)])
}
else if (i == "9") {
summary2$median[i] = median(summary2$hits[(i-1):(i-7)])
}
else if (i == "10") {
summary2$median[i] = median(summary2$hits[(i-1):(i-7)])
}
else {
summary2$median[i] = median(summary2$hits[(i-1):(i-8)])
}
}
Some coding "nudges" I suggest:
You have a
for
loop that iterates over integers, don't comparei
with a string. While R will eventually do what you think you need, there are a few things you should know. First, your inequality test is subject to lexicographic comparisons, not numeric ones, so while20 < 3
is true,"20" < "3"
is not. Second, in general I think it's just best practice to be explicit about the type/class of variables you're expecting to use, do the first conditional would be justif (i < 3)
.This is a moving-window median, which is fine. But it is inconsistent; this might be by-design, but as an analyst I find it difficult to explain that a number is a median of the last so-many-values unless it's at the beginning, in which case it is the median of the following values ... which is somewhat anathema to some time-series practices. Options for rolling calcs for the first few values (smaller than the window size) include: omitting them (not friendly for
data.frame
; replacing them withNA
until you have enough values for the vector; or allow partial windows (where if you're expecting a window size of 5, then the first value would be a median of itself, second a median of the first two, etc).
I'll import those thoughts into my code, and end up with below. First, the standard package for rolling-window calcs in R has been for a long time the zoo::rollapply
family of functions. (A recent newcomer to the field is the slider
package; I don't have experience with it yet, but it offers many features that zoo
does not.)
First, I'll demo this on just the "drink'
data, in rows 1:12
. The 9
is the window size: since you want from i-1
to i-8
in general, then you need 9, which is the previous 8 plus the current value. Since you don't want to include the current value in the median, we'll exclude it in the calcs.
zoo::rollapply(summary2$hits[1:12], 9, function(z) median(z[-length(z)], na.rm = TRUE), align = "right")
# [1] 5 5 5 5
There were 12 values but we only returned 4 ... that's because it needed to go a bit into the data before it had enough to do a full window or 9. One of the remedies I suggested is a partial window:
zoo::rollapply(summary2$hits[1:12], 9, function(z) median(z[-length(z)], na.rm = TRUE), align = "right", partial = TRUE)
# [1] NA 4.0 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0
which is what we'll use. The NA
is not unexpected: since are general rule is take everything before the current, then the first one has nothing ... so it has no values to median
.
Since you want the first few to look "forward" in time a little (I mentioned it in the nudge #2 above), we'll write a function that does this and then compensates for the first few values.
func <- function(x, k = 9) {
out <- zoo::rollapply(x, k, function(z) median(z[-length(z)], na.rm = TRUE), align = "right", partial = TRUE)
out[ seq_len(min(2, length(x))) ] <- median(head(x, 4), na.rm = TRUE)
out
}
now our return for "drink"
looks like:
func(summary2$hits[1:12])
# [1] 4.5 4.5 4.5 5.0 4.5 5.0 5.0 5.0 5.0 5.0 5.0 5.0
Now, to do this by "keywords"
, we can use ave
:
summary2$rollmedian <- ave(summary2$hits, summary2$keywords, FUN = func)
summary2
# keywords hits date rollmedian
# 1 drink 4 01-01-2016 4.5
# 2 drink 5 01-02-2016 4.5
# 3 drink 8 01-03-2016 4.5
# 4 drink 4 01-04-2016 5.0
# 5 drink 5 01-05-2016 4.5
# 6 drink 8 01-06-2016 5.0
# 7 drink 4 01-07-2016 5.0
# 8 drink 5 01-08-2016 5.0
# 9 drink 8 01-09-2016 5.0
# 10 drink 4 01-10-2016 5.0
# 11 drink 5 01-11-2016 5.0
# 12 drink 8 01-12-2016 5.0
# 13 food 4 01-01-2016 4.5
# 14 food 5 01-02-2016 4.5
# 15 food 8 01-03-2016 4.5
# 16 food 4 01-04-2016 5.0
# 17 food 5 01-05-2016 4.5
# 18 food 8 01-06-2016 5.0
# 19 food 4 01-07-2016 5.0
# 20 food 5 01-08-2016 5.0
# 21 food 8 01-09-2016 5.0
# 22 food 4 01-10-2016 5.0
# 23 food 5 01-11-2016 5.0
# 24 food 8 01-12-2016 5.0
Data
summary2 <- structure(list(keywords = c("drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "drink", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food ", "food "), hits = c(4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8, 4, 5, 8), date = c(" 01-01-2016", " 01-02-2016", " 01-03-2016", " 01-04-2016", " 01-05-2016", " 01-06-2016", " 01-07-2016", " 01-08-2016", " 01-09-2016", " 01-10-2016", " 01-11-2016", " 01-12-2016", "01-01-2016", "01-02-2016", "01-03-2016", "01-04-2016", "01-05-2016", "01-06-2016", "01-07-2016", "01-08-2016", "01-09-2016", "01-10-2016", "01-11-2016", "01-12-2016")), class = "data.frame", row.names = c(NA, -24L))
这篇关于R-需要帮助,以按列关键字拆分或按关键字分组,然后循环进行if else计算并合并回去或取消分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!