如何为每个分组元素选择随机非连续日期? [英] How select random non-consecutive dates for every grouped element?

查看:27
本文介绍了如何为每个分组元素选择随机非连续日期?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试为每个分组列选择不连续的日期.

I am currently trying to select non-consecutive dates for every grouped column.

换句话说,我有以下数据框:

In other words, I have the below dataframe:

我想基本上 group_by(Site) 然后为每个分组的站点只保留 3 个随机的非连续日期.例如,如果 HP37P1B 的日期对应于 3 月 12 日、3 月 13 日、3 月 14 日和 3 月 7 日 - 我需要一个只有以下内容的数据框(例如):

I would like to basically group_by(Site) and then keep only 3 random non-consecutive dates for every grouped Site. For example, if HP37P1B has dates corresponding to 12th March, 13th March, 14th March and 7th March - I need a dataframe (for example) that only has:

HP37P1B 3 月 12 日

HP37P1B 3 月 14 日

HP37P1B 3 月 7 日

到目前为止,我已经尝试了许多使用 diff()ave()lubridate 包的 stackoverflow 帖子,但是我没有取得任何成功.

So far I have tried a number of stackoverflow posts that use diff(), ave(), and the lubridate package, but I haven't had any success.

编辑

根据下面的评论,我试图让这个问题可以重现

Based on comments below, I am trying to make this question reproducible

dput(uniqueSiteDate)

structure(list(Site = c("HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", 
"INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", 
"IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "PANMP1B", "PANMP2B", 
"PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", "UPMAP3B", "UPMAP4B", 
"UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", "VAR310U", "VAR410R", 
"VAR410U", "HP36P1B", "HP36P3B", "HP36P4B", "HP4008R", "INBS04R", 
"INBS04U", "SEL107R", "SEL107U", "SEL207R", "SEL207U", "OLV110R", 
"OLV110U", "OLV208R", "OLV208U", "THEN10U", "HP37P1B", "HP37P2B", 
"HP37P4B", "HP4008U", "INME03R", "INME03U", "INOA03R", "IPTO04R", 
"IPTO04U", "IPTO06R", "IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", 
"PANMP1B", "PANMP2B", "PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", 
"UPMAP3B", "UPMAP4B", "UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", 
"VAR310U", "VAR410R", "VAR410U", "OLV110R", "OLV110U", "OLV208R", 
"OLV208U", "THEN10U", "HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", 
"INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", 
"IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "PANMP1B", "PANMP2B", 
"PANMP3B", "STIN02R", "STIN02U", "UPMAP1B", "UPMAP3B", "UPMAP4B", 
"UPMAP5B", "UPMAP6B", "VAR210R", "VAR310R", "VAR310U", "VAR410R", 
"VAR410U", "OLV110R", "OLV110U", "OLV208R", "OLV208U", "THEN10U", 
"HP37P1B", "HP37P2B", "HP37P4B", "HP4008U", "INME03R", "INME03U", 
"INOA03R", "IPTO04R", "IPTO04U", "IPTO06R", "IPTO06U", "OLCAP2B", 
"OLCAP3B"), Date = structure(c(18333, 18333, 18333, 18333, 18335, 
18335, 18335, 18338, 18335, 18338, 18335, 18333, 18333, 18333, 
18334, 18334, 18334, 18331, 18331, 18331, 18330, 18330, 18330, 
18330, 18332, 18332, 18332, 18332, 18332, 18325, 18325, 18325, 
18325, 18327, 18327, 18327, 18327, 18327, 18328, 18340, 18340, 
18340, 18340, 18340, 18334, 18334, 18334, 18334, 18336, 18336, 
18336, 18339, 18336, 18340, 18336, 18335, 18334, 18334, 18335, 
18335, 18335, 18332, 18332, 18332, 18331, 18331, 18331, 18331, 
18333, 18333, 18333, 18333, 18333, 18341, 18341, 18341, 18341,
18341, 18335, 18335, 18335, 18335, 18383, 18383, 18383, 18384, 
18384, 18384, 18384, 18385, 18385, 18335, 18342, 18342, 18341, 
18383, 18383, 18345, 18349, 18349, 18349, 18349, 18340, 18339, 
18340, 18341, 18339, 18386, 18386, 18348, 18346, 18347, 18328, 
18328, 18328, 18328, 18390, 18389, 18391, 18392, 18392, 18392, 
18392, 18392, 18392), class = "Date")), row.names = c(NA, -125L
), groups = structure(list(Site = c("HP36P1B", "HP36P3B", "HP36P4B", 
"HP37P1B", "HP37P2B", "HP37P4B", "HP4008R", "HP4008U", "INBS04R", 
"INBS04U", "INME03R", "INME03U", "INOA03R", "IPTO04R", "IPTO04U", 
"IPTO06R", "IPTO06U", "OLCAP2B", "OLCAP3B", "OLCAP5B", "OLV110R", 
"OLV110U", "OLV208R", "OLV208U", "PANMP1B", "PANMP2B", "PANMP3B", 
"SEL107R", "SEL107U", "SEL207R", "SEL207U", "STIN02R", "STIN02U", 
"THEN10U", "UPMAP1B", "UPMAP3B", "UPMAP4B", "UPMAP5B", "UPMAP6B", 
"VAR210R", "VAR310R", "VAR310U", "VAR410R", "VAR410U"), .rows = structure(list(
    30L, 31L, 32L, c(1L, 45L, 79L, 113L), c(2L, 46L, 80L, 114L
    ), c(3L, 47L, 81L, 115L), 33L, c(4L, 48L, 82L, 116L), 34L, 
    35L, c(5L, 49L, 83L, 117L), c(6L, 50L, 84L, 118L), c(7L, 
    51L, 85L, 119L), c(8L, 52L, 86L, 120L), c(9L, 53L, 87L, 121L
    ), c(10L, 54L, 88L, 122L), c(11L, 55L, 89L, 123L), c(12L, 
    56L, 90L, 124L), c(13L, 57L, 91L, 125L), c(14L, 58L, 92L), 
    c(40L, 74L, 108L), c(41L, 75L, 109L), c(42L, 76L, 110L), 
    c(43L, 77L, 111L), c(15L, 59L, 93L), c(16L, 60L, 94L), c(17L, 
    61L, 95L), 36L, 37L, 38L, 39L, c(18L, 62L, 96L), c(19L, 63L, 
    97L), c(44L, 78L, 112L), c(20L, 64L, 98L), c(21L, 65L, 99L
    ), c(22L, 66L, 100L), c(23L, 67L, 101L), c(24L, 68L, 102L
    ), c(25L, 69L, 103L), c(26L, 70L, 104L), c(27L, 71L, 105L
    ), c(28L, 72L, 106L), c(29L, 73L, 107L)), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -44L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

要回答其他问题,有时每个站点有 3 个以上的日期,但有时每个站点只有 1 个日期.但想法是在给定站点的情况下选择 n 个非连续日期.换句话说,如果一个特定站点有 4 个日期,我需要 3 个不连续的日期.如果某个特定网站只有 1 个日期,我们就保留它.

To answer other questions, sometimes there are more than 3 dates per site, but sometimes there is just 1 date per site. But the idea is to choose n Number of non-consecutive dates given a Site. In other words, if a particular site has 4 dates, I need 3 non-consecutive ones. If a particular site has only 1 date, let's just leave that in.

推荐答案

请检查是否达到目的?实际上,使用提供标准选择最大可能日期很困难(至少对我而言).我们可以通过以下策略识别连续和非连续组中的日期.但是考虑来自一组连续 3 个日期的两个场景.如果 random 样本包含 2 个单位,则这些单位也可以是连续的或非连续的.假设如果我们进一步选择奇数 (2) 或偶数 (1) 行,那么在我看来,样本将是判断性的而不是随机的.这是采用的策略-

Please check whether it serves the purpose? Actually, selecting maximum possible dates with the provide criteria is difficult (at least for me). We can identify dates in consecutive and non-consecutive groups by the following strategy. But consider two scenarios from a group of say 3 consecutive dates. If the random sample contains 2 units, these can be consecutive or non-consecutive as well. Suppose if we further select either odd (2) or even(1) rows then the sample would have been judgmental and not random in my opinion. This is the strategy adopted -

  • 将数据分组
  • 通过purrr::map_df对每组分别进行操作,最后行绑定数据
  • 将数据(现在是组)划分为连续和非连续日期(每个连续日期都在其自己的组中).从每个组中选择唯一的行.
  • 最后从这些行中的每一行中选择三个(或根据小组结果选择更少).
  • splitted the data in groups
  • carried out operations in each group separately through purrr::map_df which finally row binds the data
  • divided the data (now groups) in consecutive and non-consecutive dates (each consecutive date will be in its own group). Select unique row from each group.
  • finally select three (or less as per group outcome) from each of these rows.
library(tidyverse)

df %>% 
  ungroup() %>% 
  group_split(Site) %>% 
  map_df(., ~ .x %>% ungroup() %>%
        arrange(Date) %>%
        mutate(n = 1) %>%
        complete(Date = seq.Date(first(Date), last(Date), by = 'days')) %>%
        group_by(n = cumsum(is.na(n))) %>%
        filter(!is.na(Site)) %>%
        sample_n(1) %>%
        ungroup() %>%
        sample_n(min(n(), 3))) %>%
  select(-n)

# A tibble: 86 x 2
   Date       Site   
   <date>     <chr>  
 1 2020-03-04 HP36P1B
 2 2020-03-04 HP36P3B
 3 2020-03-04 HP36P4B
 4 2020-03-07 HP37P1B
 5 2020-03-12 HP37P1B
 6 2020-03-07 HP37P2B
 7 2020-03-12 HP37P2B
 8 2020-03-07 HP37P4B
 9 2020-03-12 HP37P4B
10 2020-03-04 HP4008R
# ... with 76 more rows

注意:您的 dput 已分组,因此我必须在代码的第二行添加 ungroup(),您可以将其删除

Note: Your dput was grouped so I had to add ungroup() in second line of the code, which you may remove

这篇关于如何为每个分组元素选择随机非连续日期?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆