使用dplyr选择过滤的行之前的行 [英] Select rows before a filtered row using dplyr
问题描述
我正在进行一项研究,我们使用照相机放置在巢箱中,以确定我们的研究物种何时产下第一枚卵。有些相机不是很可靠,我想看看在下第一枚卵之前是否有连续的照片。这样我就不能确定这是第一个卵子约会。有超过165,000张照片和200个以上的嵌套,因此我按嵌套框ID分组,将行过滤到至少有1个鸡蛋的行,然后使用slice函数选择包含数据的第一行。这是一个可复制的示例:
I'm working on a study where we used a camera placed inside a nest box to determine when our study species laid its first egg. Some of the cameras weren't super reliable, and I'd like to see if there were continuous photos before the date where the first egg was laid. This way I can no for sure that this is the first egg date. There are >165,000 photos and >200 nests, so I grouped by nest box ID, filtered the rows down to those that have at least 1 egg, and then used the slice function to select the first row with data. Here's a reproducible example:
example <- structure(list(boxID = c("CA10", "CA10", "CA10", "CA10", "CA10",
"CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10", "CA10",
"CA10", "CA10"), visitType = c("Image", "Image", "Image", "Image",
"Image", "Image", "Image", "Image", "Image", "Image", "Image",
"Image", "Image", "Image", "Image"), day = c(25L, 25L, 25L, 26L,
26L, 26L, 27L, 27L, 27L, 28L, 28L, 28L, 29L, 29L, 29L), month = c("MAR",
"MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR", "MAR",
"MAR", "MAR", "MAR", "MAR", "MAR"), year = c(2018, 2018, 2018,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,
2018), timeChecked = c("02:59", "09:06", "15:13", "02:59", "09:07",
"15:14", "02:59", "09:07", "15:13", "02:58", "09:06", "15:12",
"02:58", "09:06", "15:12"), species = c("Empty", "Empty", "Empty",
"Empty", "Empty", "Empty", "Empty", "Empty", "American Kestrel",
"Empty", "American Kestrel", "American Kestrel", "American Kestrel",
"American Kestrel", "American Kestrel"), sexAdult = c(NA, NA,
NA, NA, NA, NA, NA, NA, "Female", NA, "Female", "Female", "Female",
NA, NA), numEggs = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
"1", "1", "1", "1", "1"), numNestlings = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), date = structure(c(17615,
17615, 17615, 17616, 17616, 17616, 17617, 17617, 17617, 17618,
17618, 17618, 17619, 17619, 17619), class = "Date")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L), .Names = c("boxID",
"visitType", "day", "month", "year", "timeChecked", "species",
"sexAdult", "numEggs", "numNestlings", "date"))
这是我的代码来查找至少有一个鸡蛋的第一行:
Here's the code I have to find the first row that has at least 1 egg:
example %>%
mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>% # remove unknowns and other values that should be NA
select(boxID, date, numEggs, visitType) %>%
group_by(boxID) %>%
filter(numEggs > 0) %>%
slice(1)
我想看看在第一行之前放置5或10行,并放置一个鸡蛋,以确保在该时间点之前有连续数据。有没有办法使用slice或其他dplyr函数对行进行索引?
I'd like to look at the 5 or 10 rows before this first row with an egg to make sure there was continuous data up to this point in time. Is there a way to do this row indexing with slice or some other dplyr function?
推荐答案
这是一种方法。 match
返回第一个 numEggs>的位置。 0
,然后我们简单地从该位置获得另外的 n_previous
行。我们使用 max(1,...)
,以便如果第一个 numEggs>的位置不出错。 0
<
n_previous
。
Here's one way. match
returns the position of first numEggs > 0
and then we simply get additional n_previous
rows from that position. We use max(1, ...)
so that we don't get error if position of first numEggs > 0
<
n_previous
.
n_previous <- 5
example %>%
mutate_at(vars(numEggs, numNestlings), na_if, 'unknown') %>%
select(boxID, date, numEggs, visitType) %>%
group_by(boxID) %>%
slice(max(1, match(TRUE, numEggs > 0) - n_previous):match(TRUE, numEggs > 0))
# A tibble: 6 x 4
# Groups: boxID [1]
boxID date numEggs visitType
<chr> <date> <chr> <chr>
1 CA10 2018-03-26 <NA> Image
2 CA10 2018-03-27 <NA> Image
3 CA10 2018-03-27 <NA> Image
4 CA10 2018-03-27 <NA> Image
5 CA10 2018-03-28 <NA> Image
6 CA10 2018-03-28 1 Image
这篇关于使用dplyr选择过滤的行之前的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!