在R中,平均行值直到达到特定条件,然后重新启动,并在新列中输出 [英] In R, average row value until hit a specific condition, then restart, with output in new column

查看:54
本文介绍了在R中,平均行值直到达到特定条件,然后重新启动,并在新列中输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用GPS数据,并试图弄清楚如何平均11和15个定位点的经度和纬度.我已经在类似的问题中看到了有关如何平均每n行的解决方案.问题在于,卫星偶尔会爆炸,并且定位在13或14处停止.因此,在这些情况下,我只想取3或4个值的平均值而不是5个值.因此,我希望从直到序列号再次下降(或只要它增加时,序列号就为11)(我需要它包括最后一组,而不会再次下降到较低的数字).我首先删除了序列号不在所需的11-15范围内的所有行.因此,例如一个虚拟数据集,这给我留下了:

I am working with GPS data and trying to figure out how to average the 11th-15th fixes for latitude and longitude. I have seen solutions in similar questions for how to average every n rows. The problem is that occasionally the satellites bomb out and the fixes stop at 13 or 14. So, in these cases, I only want to average 3 or 4 values instead of 5. So I am looking to average values for latitude and longitude starting from where the number in series is 11 until the number in series drops again (or as long as it is increasing? I need it to include the last set, which will not drop back down to a low number again). I have started by deleting all rows where the number in series is NOT in my desired 11-15 ranges. So,for an example dummy data set, this leaves me with:

      Date      Time     Long       Lat     NoInSeries
12  17/11/2014 22:09:17 115.9508 -31.82850    11
13  17/11/2014 22:09:18 115.9508 -31.82846    12
14  17/11/2014 22:09:19 115.9513 -31.82864    13
15  17/11/2014 22:09:21 115.9511 -31.82863    14
26  18/11/2014 00:07:14 115.9509 -31.82829    11
27  18/11/2014 00:07:15 115.9509 -31.82829    12
28  18/11/2014 00:07:16 115.9509 -31.82830    13
29  18/11/2014 00:07:17 115.9509 -31.82830    14
30  18/11/2014 00:07:18 115.9509 -31.82831    15
56  18/11/2014 10:00:24 115.9513 -31.82670    11
57  18/11/2014 10:00:25 115.9514 -31.82670    12
58  18/11/2014 10:00:26 115.9514 -31.82669    13
59  18/11/2014 10:00:27 115.9514 -31.82668    14
60  18/11/2014 10:00:28 115.9514 -31.82668    15

我想要的输出将是这样,第一个平均为4(11-14),接下来的两个平均为5(11-15):

My desired output would be something like this, with the first one averaging 4 (11-14) and the next two averaging 5 (11-15):

     Date      Time     Long       Lat     NoInSeries  AvgLong     Avg Lat
12  17/11/2014 22:09:17 115.9508 -31.82850    11       115.9510   -31.82856
13  17/11/2014 22:09:18 115.9508 -31.82846    12          NA          NA
14  17/11/2014 22:09:19 115.9513 -31.82864    13          NA          NA
15  17/11/2014 22:09:21 115.9511 -31.82863    14          NA          NA
26  18/11/2014 00:07:14 115.9509 -31.82829    11       115.9509   -31.82830
27  18/11/2014 00:07:15 115.9509 -31.82829    12          NA          NA
28  18/11/2014 00:07:16 115.9509 -31.82830    13          NA          NA
29  18/11/2014 00:07:17 115.9509 -31.82830    14          NA          NA
30  18/11/2014 00:07:18 115.9509 -31.82831    15          NA          NA
56  18/11/2014 10:00:24 115.9513 -31.82670    11       115.9514   -31.82669
57  18/11/2014 10:00:25 115.9514 -31.82670    12          NA          NA
58  18/11/2014 10:00:26 115.9514 -31.82669    13          NA          NA
59  18/11/2014 10:00:27 115.9514 -31.82668    14          NA          NA
60  18/11/2014 10:00:28 115.9514 -31.82668    15          NA          NA

然后,我将遍历并删除AvgLong == NA的所有行,因此我的最终输出将只包含number = 11的所有行的平均值.

I would then go through and delete all rows where AvgLong==NA, so my final output would just have all the rows where number in series=11 with the averages.

我真的不知道从哪里开始,代码...我发现的例子都讨论了平均行数的精确度,而不是可变数的平均数.

I really don't know where to start with the code for this...the examples I have found have all discussed averaging an exact number of rows, rather than a variable number.

例如:

c( tapply( x, (row(x)-1)%/%5, mean ) )

或者:

idx <- ceiling(seq_len(nrow(dd)) / 5)
# do colMeans on all columns except last one.
res <- lapply(split(dd[-(ncol(dd))], idx), colMeans, na.rm = TRUE)
# assign first value of "datetime" in each 5-er group as names to list
names(res) <- dd$datetime[seq(1, nrow(df), by=5)]
# bind them to give a matrix
res <- do.call(rbind, res)

此外,我通常看到的答案然后将平均值输出为新数据帧...最终,我还希望在以下条件下进行平均:如果调度为"Multifix",则我希望将平均值平均为11最高可升至15,而如果日程安排为连续",我想平均从181升至每项升至高位……).像这样:

Also, the answers I have seen generally then output the averages as a new data frame... Ultimately, I also want to have this averaging under a condition: if schedule is 'Multifix', I want to average 11 to however high it goes up to 15, whereas if schedule is 'Continuous', I want to average from 181 up until however high each one goes...). Something like this:

if(import.list$Schedule=='Multifix'){
...code to average Long and Lat for Number in Series from 11 up to however high it goes (up to 15)...
} else {
...code to average Long and Lat for Number in Series from 241 up to however high it goes...
}

也许我有一个if else语句来定义一个变量,然后在函数中使用该变量进行求平均值?

Or perhaps I have an if else statement to define a variable and then use that variable in the function to do the averaging?

...但是我想如果输出创建一个新的数据框,这种情况会使事情复杂化,这就是为什么我的目的只是向新列"AvgLong"和"AvgLat"中添加值.谢谢您的帮助!

...but I imagine this condition could complicate things if the output creates a new dataframe, which is why I was aiming for just adding values to new columns "AvgLong" and "AvgLat." Thanks for any help!!

推荐答案

#dput函数显示了我根据您的问题正在处理的数据.

#dput function shows the data I was working from your question.

dput(df1)
structure(list(ID = c(12L, 13L, 14L, 15L, 26L, 27L, 28L, 29L, 
30L, 56L, 57L, 58L, 59L, 60L), Date = c("17/11/2014", "17/11/2014", 
"17/11/2014", "17/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", "18/11/2014", 
"18/11/2014", "18/11/2014"), Time = c("22:09:17", "22:09:18", 
"22:09:19", "22:09:21", "00:07:14", "00:07:15", "00:07:16", "00:07:17", 
"00:07:18", "10:00:24", "10:00:25", "10:00:26", "10:00:27", "10:00:28"
), Long = c(115.9508, 115.9508, 115.9513, 115.9511, 115.9509, 
115.9509, 115.9509, 115.9509, 115.9509, 115.9513, 115.9514, 115.9514, 
115.9514, 115.9514), Lat = c(-31.8285, -31.82846, -31.82864, 
-31.82863, -31.82829, -31.82829, -31.8283, -31.8283, -31.82831, 
-31.8267, -31.8267, -31.82669, -31.82668, -31.82668), NoInSeries = c(11L, 
12L, 13L, 14L, 11L, 12L, 13L, 14L, 15L, 11L, 12L, 13L, 14L, 15L
)), .Names = c("ID", "Date", "Time", "Long", "Lat", "NoInSeries"
), class = "data.frame", row.names = c(NA, -14L))

#get.counter在列的值开始减小而不是升序时获取行索引.

#get.counter gets the row index when the value of the column starts to decrease as opposed to ascending order.

get.counter <- function(x){
  a1 = x
  counter = 0
  a2 = c()
  for( i in 1:length(a1)){  
    if(i < length(a1)){
      if(a1[i+1] > a1[i]){
        counter = counter + 1
      }else{
        counter = counter + 1
        a2 = c(a2, counter)
        counter = 0
      }
    }else{
      counter = counter + 1
      a2 = c(a2, counter)
    }
  }
  return(a2)
}

#avg.seg.col函数输出具有一列的分段平均值的数据帧. df1是输入数据帧,colvar是列名(例如Long或Lat),而get_counter是get.counter函数的输出.

# avg.seg.col function outputs a data frame with the segmented average of a column. df1 is the input data frame, colvar is the column name (Eg: Long or Lat), and get_counter is the output from get.counter function.

avg.seg.col <- function(df1, colvar, get_counter){ 

  long <- c()

  start = 1

  for(i in cumsum(get_counter)){
    end = i
    b1 = subset(df1, select = colvar)[start:end,]

    mean_b1 = mean(b1)

    long = c(long, mean_b1, rep(NA, (length(b1)-1)))

    start = end+1
  }
  return(data.frame(long, stringsAsFactors = FALSE))
}

#使用read.table函数从文本文件中读取数据.您需要确保您的文件存在于当前工作目录中.可以通过setwd(当前工作目录的路径")设置工作目录

# read in data from a text file using read.table function. You need to make sure your file exists in current working directory. Working directory can be set by setwd("path of current working directory")

df1 <- read.table(file = "file1.txt", 
                  header = TRUE, 
                  sep = "\t", 
                  stringsAsFactors = FALSE)

#将df1 $ NoInSeries中的向量应用于get.counter函数

# apply get.counter function with a vector from df1$NoInSeries

get_counter <- get.counter(df1$NoInSeries)

#对Long列应用avg.seg.col函数

# Apply avg.seg.col function for Long column

AvgLong <- avg.seg.col(df1, "Long", get_counter)

#为纬度"列应用avg.seg.col函数

# Apply avg.seg.col function for Lat column

AvgLat <- avg.seg.col(df1, "Lat", get_counter)

#按列合并数据帧

df2 <- do.call("cbind", list(df1, AvgLong, AvgLat))

#分配列名

colnames(df2) <- c(colnames(df2)[1:(ncol(df2)-2)], "AvgLong", "AvgLat")

输出:

     print(df2)
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
2  13 17/11/2014 22:09:18 115.9508 -31.82846         12       NA        NA
3  14 17/11/2014 22:09:19 115.9513 -31.82864         13       NA        NA
4  15 17/11/2014 22:09:21 115.9511 -31.82863         14       NA        NA
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
6  27 18/11/2014 00:07:15 115.9509 -31.82829         12       NA        NA
7  28 18/11/2014 00:07:16 115.9509 -31.82830         13       NA        NA
8  29 18/11/2014 00:07:17 115.9509 -31.82830         14       NA        NA
9  30 18/11/2014 00:07:18 115.9509 -31.82831         15       NA        NA
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669
11 57 18/11/2014 10:00:25 115.9514 -31.82670         12       NA        NA
12 58 18/11/2014 10:00:26 115.9514 -31.82669         13       NA        NA
13 59 18/11/2014 10:00:27 115.9514 -31.82668         14       NA        NA
14 60 18/11/2014 10:00:28 115.9514 -31.82668         15       NA        NA

#使用NA删除行后,输出如下所示

#after removing rows with NA, the output looks like below

df2[-(which(df2$AvgLong %in% NA)), ]
   ID       Date     Time     Long       Lat NoInSeries  AvgLong    AvgLat
1  12 17/11/2014 22:09:17 115.9508 -31.82850         11 115.9510 -31.82856
5  26 18/11/2014 00:07:14 115.9509 -31.82829         11 115.9509 -31.82830
10 56 18/11/2014 10:00:24 115.9513 -31.82670         11 115.9514 -31.82669

这篇关于在R中,平均行值直到达到特定条件,然后重新启动,并在新列中输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆