如何仅在特定子组中随机删除数据框中的行(使用dplyr :: sample_n?) [英] How to randomly remove rows in dataframe but for a specific subgroup only (with dplyr::sample_n?)

查看:34
本文介绍了如何仅在特定子组中随机删除数据框中的行(使用dplyr :: sample_n?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在特定的列中,我有几个类别.我想稀疏/稀疏/删除仅在一个类别中的某些行.我已经看到 sample_n group_by 一起使用,但是其 size 参数对分组变量中的每个类别应用了相同数量的行.我想为每个组指定不同的 size .

In a specific column, I have several categories. I want to thin/dilute/remove randomly some rows only in one category. I've seen sample_n used with group_by, but its size argument applies the removal of same number of rows for each category in the grouped variable. I want to specify different size for each group.

第二,我希望就地"完成它,这意味着我希望它返回相同的原始数据帧,只是现在在我试图稀释"的特定类别中它将有更少的行.

Second, I'm looking to do it "in place", meaning that I want it to return the same original dataframe, just that now it will have fewer rows in the specific category I sought to "dilute".

library(tidyverse)

set.seed(123)

df <- 
  tibble(
  color = sample(c("red", "blue", "yellow", "green", "brown"), size = 1000, replace = T),
  value = sample(0:750, size = 1000, replace = T)
)

df

## # A tibble: 1,000 x 2
##    color  value
##    <chr>  <int>
##  1 yellow   251
##  2 yellow   389
##  3 blue     742
##  4 blue     227
##  5 yellow   505
##  6 brown     47
##  7 green    381
##  8 red      667
##  9 blue     195
## 10 yellow   680
## # ... with 990 more rows

按颜色计数时,我看到:

When tally by color I see that:

df %>% count(color)

  color      n
  <chr>  <int>
1 blue     204
2 brown    202
3 green    191
4 red      203
5 yellow   200

现在让我们说我只想减少 red 颜色的行数.假设我只希望 10 行用于 color == red .显然,仅仅使用 sample_n 并不能帮助我实现这一目标

Now let's say that I want to decrease the number of rows only for red color. Let's say I want only 10 rows for color == red. Simply using sample_n doesn't get me there, obviously:

df %>%
  group_by(color) %>%
  sample_n(10) %>%
  count(color)

  color      n
  <chr>  <int>
1 blue      10
2 brown     10
3 green     10
4 red       10
5 yellow    10

如何指定仅 color =="red" 会具有 10 行,而其他颜色保持不变?

How can I specify that only color == "red" will have 10 rows while the other colors remain untouched?

我看到了一些类似的问题(像这样的问题),但无法根据我的情况调整答案.

I've seen some similar questions (like this one), but wasn't able to adapt the answers to my case.

推荐答案

我们可以编写一个函数来过滤特定颜色,对其进行采样并将其与原始数据绑定

We can write a function to filter specific colors, sample them and bind them with the orignal data

library(dplyr)

sample_for_color <- function(data, col_to_change, n) {
  data %>%
    filter(color %in% col_to_change) %>%
    group_by(color) %>%
    slice_sample(n = n) %>%
    ungroup %>%
    bind_rows(data %>% filter(!color %in% col_to_change))
}

new_df <- df %>% sample_for_color('red', 10)
new_df %>% count(color)

#  color      n
#  <chr>  <int>
#1 blue     204
#2 brown    202
#3 green    191
#4 red       10
#5 yellow   200

new_df <- df %>% sample_for_color(c('red', 'blue'), 10)
new_df %>% count(color)

#  color      n
#  <chr>  <int>
#1 blue      10
#2 brown    202
#3 green    191
#4 red       10
#5 yellow   200

这篇关于如何仅在特定子组中随机删除数据框中的行(使用dplyr :: sample_n?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆