在R中,平均行值直到达到特定条件,然后重新启动,并在新列中输出 [英] In R, average row value until hit a specific condition, then restart, with output in new column
问题描述
我正在使用GPS数据,并试图弄清楚如何平均11和15个定位点的经度和纬度.我已经在类似的问题中看到了有关如何平均每n行的解决方案.问题在于,卫星偶尔会爆炸,并且定位在13或14处停止.因此,在这些情况下,我只想取3或4个值的平均值而不是5个值.因此,我希望从直到序列号再次下降(或只要它增加时,序列号就为11)(我需要它包括最后一组,而不会再次下降到较低的数字).我首先删除了序列号不在所需的11-15范围内的所有行.因此,例如一个虚拟数据集,这给我留下了:
I am working with GPS data and trying to figure out how to average the 11th-15th fixes for latitude and longitude. I have seen solutions in similar questions for how to average every n rows. The problem is that occasionally the satellites bomb out and the fixes stop at 13 or 14. So, in these cases, I only want to average 3 or 4 values instead of 5. So I am looking to average values for latitude and longitude starting from where the number in series is 11 until the number in series drops again (or as long as it is increasing? I need it to include the last set, which will not drop back down to a low number again). I have started by deleting all rows where the number in series is NOT in my desired 11-15 ranges. So,for an example dummy data set, this leaves me with:
Date Time Long Lat NoInSeries
12 17/11/2014 22:09:17 115.9508 -31.82850 11
13 17/11/2014 22:09:18 115.9508 -31.82846 12
14 17/11/2014 22:09:19 115.9513 -31.82864 13
15 17/11/2014 22:09:21 115.9511 -31.82863 14
26 18/11/2014 00:07:14 115.9509 -31.82829 11
27 18/11/2014 00:07:15 115.9509 -31.82829 12
28 18/11/2014 00:07:16 115.9509 -31.82830 13
29 18/11/2014 00:07:17 115.9509 -31.82830 14
30 18/11/2014 00:07:18 115.9509 -31.82831 15
56 18/11/2014 10:00:24 115.9513 -31.82670 11
57 18/11/2014 10:00:25 115.9514 -31.82670 12
58 18/11/2014 10:00:26 115.9514 -31.82669 13
59 18/11/2014 10:00:27 115.9514 -31.82668 14
60 18/11/2014 10:00:28 115.9514 -31.82668 15
我想要的输出将是这样,第一个平均为4(11-14),接下来的两个平均为5(11-15):
My desired output would be something like this, with the first one averaging 4 (11-14) and the next two averaging 5 (11-15):
Date Time Long Lat NoInSeries AvgLong Avg Lat
12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
然后,我将遍历并删除AvgLong == NA的所有行,因此我的最终输出将只包含number = 11的所有行的平均值.
I would then go through and delete all rows where AvgLong==NA, so my final output would just have all the rows where number in series=11 with the averages.
我真的不知道从哪里开始,代码...我发现的例子都讨论了平均行数的精确度,而不是可变数的平均数.
I really don't know where to start with the code for this...the examples I have found have all discussed averaging an exact number of rows, rather than a variable number.
例如:
c( tapply( x, (row(x)-1)%/%5, mean ) )
或者:
idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)
此外,我通常看到的答案然后将平均值输出为新数据帧...最终,我还希望在以下条件下进行平均:如果调度为"Multifix",则我希望将平均值平均为11最高可升至15,而如果日程安排为连续",我想平均从181升至每项升至高位……).像这样:
Also, the answers I have seen generally then output the averages as a new data frame... Ultimately, I also want to have this averaging under a condition: if schedule is 'Multifix', I want to average 11 to however high it goes up to 15, whereas if schedule is 'Continuous', I want to average from 181 up until however high each one goes...). Something like this:
if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}
也许我有一个if else语句来定义一个变量,然后在函数中使用该变量进行求平均值?
Or perhaps I have an if else statement to define a variable and then use that variable in the function to do the averaging?
...但是我想如果输出创建一个新的数据框,这种情况会使事情复杂化,这就是为什么我的目的只是向新列"AvgLong"和"AvgLat"中添加值.谢谢您的帮助!
...but I imagine this condition could complicate things if the output creates a new dataframe, which is why I was aiming for just adding values to new columns "AvgLong" and "AvgLat." Thanks for any help!!
推荐答案
#dput函数显示了我根据您的问题正在处理的数据.
#dput function shows the data I was working from your question.
dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L,
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014",
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014",
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18",
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17",
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509,
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514,
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864,
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831,
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L,
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))
#get.counter在列的值开始减小而不是升序时获取行索引.
#get.counter gets the row index when the value of the column starts to decrease as opposed to ascending order.
get.counter <- function(x){
a1 = x
counter = 0
a2 = c()
for( i in 1:length(a1)){
if(i < length(a1)){
if(a1[i+1] > a1[i]){
counter = counter + 1
}else{
counter = counter + 1
a2 = c(a2, counter)
counter = 0
}
}else{
counter = counter + 1
a2 = c(a2, counter)
}
}
return(a2)
}
#avg.seg.col函数输出具有一列的分段平均值的数据帧. df1是输入数据帧,colvar是列名(例如Long或Lat),而get_counter是get.counter函数的输出.
# avg.seg.col function outputs a data frame with the segmented average of a column. df1 is the input data frame, colvar is the column name (Eg: Long or Lat), and get_counter is the output from get.counter function.
avg.seg.col <- function(df1, colvar, get_counter){
long <- c()
start = 1
for(i in cumsum(get_counter)){
end = i
b1 = subset(df1, select = colvar)[start:end,]
mean_b1 = mean(b1)
long = c(long, mean_b1, rep(NA, (length(b1)-1)))
start = end+1
}
return(data.frame(long, stringsAsFactors = FALSE))
}
#使用read.table函数从文本文件中读取数据.您需要确保您的文件存在于当前工作目录中.可以通过setwd(当前工作目录的路径")设置工作目录
# read in data from a text file using read.table function. You need to make sure your file exists in current working directory. Working directory can be set by setwd("path of current working directory")
df1 <- read.table(file = "file1.txt",
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
#将df1 $ NoInSeries中的向量应用于get.counter函数
# apply get.counter function with a vector from df1$NoInSeries
get_counter <- get.counter(df1$NoInSeries)
#对Long列应用avg.seg.col函数
# Apply avg.seg.col function for Long column
AvgLong <- avg.seg.col(df1, "Long", get_counter)
#为纬度"列应用avg.seg.col函数
# Apply avg.seg.col function for Lat column
AvgLat <- avg.seg.col(df1, "Lat", get_counter)
#按列合并数据帧
df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))
#分配列名
colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")
输出:
print(df2)
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
2 13 17/11/2014 22:09:18 115.9508 -31.82846 12 NA NA
3 14 17/11/2014 22:09:19 115.9513 -31.82864 13 NA NA
4 15 17/11/2014 22:09:21 115.9511 -31.82863 14 NA NA
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
6 27 18/11/2014 00:07:15 115.9509 -31.82829 12 NA NA
7 28 18/11/2014 00:07:16 115.9509 -31.82830 13 NA NA
8 29 18/11/2014 00:07:17 115.9509 -31.82830 14 NA NA
9 30 18/11/2014 00:07:18 115.9509 -31.82831 15 NA NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670 12 NA NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669 13 NA NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668 14 NA NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668 15 NA NA
#使用NA删除行后,输出如下所示
#after removing rows with NA, the output looks like below
df2[-(which(df2$AvgLong %in% NA)), ]
ID Date Time Long Lat NoInSeries AvgLong AvgLat
1 12 17/11/2014 22:09:17 115.9508 -31.82850 11 115.9510 -31.82856
5 26 18/11/2014 00:07:14 115.9509 -31.82829 11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670 11 115.9514 -31.82669
这篇关于在R中,平均行值直到达到特定条件,然后重新启动,并在新列中输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!