子集数据帧以包括之前和之后的20行 [英] Subsetting a data frame to include 20 rows before and after

查看:37
本文介绍了子集数据帧以包括之前和之后的20行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道标题有点la脚,但我想不出什么来称呼它。我正在尝试使用lon(经度列)中显示的值对大型数据框进行子集化。我使用的当前子设置脚本有效,并且它会在出现-180(n / a值)的任何时间创建子集,并包括一个或多个-180出现之前和之后的第一个非-180数字。我的问题是,我希望子集由任何-180之前的20个经度和之后的20个经度组成。由于我的许多文件都以-180s开头并以-180s结尾,所以这是创建错误。我只是不知道如何告诉脚本-180s子集,而且也忽略了可能出现在第一行或最后一行中的任何内容。理想情况下,脚本将仅包含-180子集,该子集之前具有20个经度,之后具有20个经度。另外,我永远也不知道在文件的开头和结尾可能出现多少个-180,这是我自己找出最大的问题。以下是我的数据和当前子集代码的示例。预先感谢您的帮助! 编辑:同样重要的是,行保持相同的顺序,并且不按任何顺序进行排序,因为这是按时间顺序排列的数据。我的数据框有4461行和7列。 编辑:以下是我的数据框的一小部分示例。

I know the title is kind of lame but I couldn't think of anything else to call this. I’m trying to subset a large data frame using the values that appear in the lon (longitude column). The current subsetting script I have works, and it creates subsets any time a -180 (the n/a value) appears, and includes the first non -180 number before and after one or more -180s is present. My problem is that I would like the subsets to be comprised of the 20 longitudes before any -180s, and 20 after. Since many of my files start with -180s and end with -180s this is creating and error. I just have no idea how to tell the script to subset -180s but also to ignore any that might appear in the first or last rows. Ideally the script would only subset -180s that have 20 longitudes before and 20 longitudes after them. Also, I will never know how many -180s might appear at the start and end of a file, which has been the biggest problem with figuring this out for myself. Below is a sample of my data and my current subsetting code. Thank you in advance for your help! It's also very important that the rows stay in the same order and are not sorted in any way as this is chronological data. And my data frame has 4461 rows and 7 columns. below is a small sample of my data frame.

 cols <- structure(list(fixType = structure(c(39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L), .Label = c("firstfix +indoors +startpoint", "firstfix +indoors +startpoint +cluster_center", "firstfix +indoors +stationary", "firstfix +indoors +stationary +cluster_center", "firstfix +invehicle +startpoint", "firstfix +invehicle +startpoint +cluster_center", "firstfix +invehicle +stationary +cluster_center", "firstfix +outdoors +startpoint", "firstfix +outdoors +startpoint +cluster_center", "firstfix +outdoors +stationary", "firstfix +outdoors +stationary +cluster_center", "inserted +indoors +midpoint", "inserted +indoors +pausepoint", "inserted +indoors +stationary", "inserted +indoors +stationary +cluster_center", "inserted +invehicle +midpoint", "inserted +invehicle +pausepoint", "inserted +invehicle +stationary", "inserted +invehicle +stationary +cluster_center", "inserted +outdoors +midpoint", "inserted +outdoors +pausepoint", "inserted +outdoors +stationary", "inserted +outdoors +stationary +cluster_center", "lastfix +indoors +endpoint", "lastfix +indoors +endpoint +cluster_center", "lastfix +indoors +stationary", "lastfix +indoors +stationary +cluster_center", "lastfix +invehicle +endpoint", "lastfix +invehicle +endpoint +cluster_center", "lastfix +outdoors +endpoint", "lastfix +outdoors +endpoint +cluster_center", "lastfix +outdoors +stationary", "lastvalidfix +indoors +stationary", "lastvalidfix +indoors +stationary +cluster_center", "lastvalidfix +invehicle +stationary", "lastvalidfix +invehicle +stationary +cluster_center", "lastvalidfix +outdoors +stationary", "lastvalidfix +outdoors +stationary +cluster_center", "unknown", "valid +indoors +endpoint", "valid +indoors +endpoint +cluster_center", "valid +indoors +midpoint", "valid +indoors +pausepoint", "valid +indoors +pausepoint +cluster_center", "valid +indoors +startpoint", "valid +indoors +startpoint +cluster_center", "valid +indoors +stationary", "valid +indoors +stationary +cluster_center", "valid +invehicle +endpoint", "valid +invehicle +endpoint +cluster_center", "valid +invehicle +midpoint", "valid +invehicle +pausepoint", "valid +invehicle +startpoint", "valid +invehicle +startpoint +cluster_center", "valid +invehicle +stationary", "valid +invehicle +stationary +cluster_center", "valid +outdoors +endpoint", "valid +outdoors +endpoint +cluster_center", "valid +outdoors +midpoint", "valid +outdoors +pausepoint", "valid +outdoors +pausepoint +cluster_center", "valid +outdoors +startpoint", "valid +outdoors +startpoint +cluster_center", "valid +outdoors +stationary", "valid +outdoors +stationary +cluster_center"), class = "factor"), lon = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), lat = c(-180, -180, -180, -180, -180, -180, -180, -180, -180, -180), activityIntensity = c(2L, 2L, 1L, 2L, 2L, 2L, 0L, 2L, 1L, 0L), Impute = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 4352L, 4353L, 4354L), subsetNum = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum"), row.names = c(4462L, 4463L, 4464L, 4465L, 4466L, 4467L, 4468L, 8813L, 8814L, 8815L), class = "data.frame") 

子集代码:

set.seed(5)
n <- length(df) #make it equal to the length of whatever the input file is

impCols <- df[ , c("fixType", "lon", "lat", "activityIntensity", "Impute", "ID", "subsetNum")]

test.df <- data.frame(impCols)

df <- test.df
obs <- dim(df)[1]

counter <- 1
subM.List <- list()

start.idx <- NA

for(i in 1:obs){
    if (is.na(start.idx) & (substr(df[i,"lon"], 1, 4) == -180)){
      start.idx <- i-1
  }
  else if (!is.na(start.idx) & (substr(df[i,"lon"], 1, 4) != -180)){
    end.idx <- i+1 #the plus one will give you the first two instances of signal instead of just the first
    subMat <- df[start.idx:end.idx,]
    subM.List[[counter]] <- subMat
    start.idx <- NA
    counter <- counter + 1
  }
}


推荐答案

indicators <-  df$lon == -180

# the first and last non-zero indicators are your index boundaries
indx.min <- min(which(!indicators))    # will issue warning if lon is nothing but '-180'
indx.max <- max(which(!indicators))    # will issue warning if lon is nothing but '-180'

"My problem is that I would like the subsets to be comprised of 
    the 20 longitudes before any -180s, and 20 after"


# `inPlay` are the indicators that are not at the extreme ends
inPlay <- which(indicators)
inPlay <- inPlay[inPlay > indx.min & inPlay < indx.max]

# Sample Size
S <- 20  # use a variable so you can change it as needed

diffPlay <- diff(inPlay)
stop <- c(which(diffPlay !=1 ), length(inPlay))
start <- c(1,   which(diffPlay !=1 )+1)

# these are your rowranges of `180s`.  You can have a look if youd like
rbind(inPlay[start], inPlay[stop])

# we are going to take the 20 rows before each "start"
#   and the 20 rows after each "start" + "plus"
inPlayPlus <- inPlay[stop] - inPlay[start]
inPlayStart <- inPlay[start]

## The names given to `inPlay` will be the name of your subsetted list
names(inPlayStart) <- ifelse(inPlayPlus > 0, paste0("Rows", inPlayStart, "_to_", inPlayStart+inPlayPlus), paste0("Row", inPlayStart))

subsetsList <- 
  lapply(seq_along(inPlayStart), function(i) {
      # This can be one line.  Broken up so you can see what's happening
      from <- max(indx.min, inPlayStart[[i]]-S) # notice, calling max on the min
      to   <- min(indx.max, inPlayStart[[i]] + inPlayPlus[[i]] +S) #    and  min on the max

      cat("i is ", i, "\tPlus=", inPlayPlus[[i]], "\t(from, to) = (", from, ", ", to, ")\tDIFF=", to-from, "\n", sep="")
      indx <- if (inPlayPlus[[i]] == 0) from:to else setdiff(from:to, inPlayStart[[i]]+(1:inPlayPlus[[i]]) )
      df[indx, ] 
    })


## Have a look at the results
subsetsList

这篇关于子集数据帧以包括之前和之后的20行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆