通过间接引用列来修改数据框中的某些值 [英] Modify certain values in a data frame by indirect reference to the columns

查看:46
本文介绍了通过间接引用列来修改数据框中的某些值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在整理一些数据,在这些数据中,我们将排序失败的数据归入垃圾箱,并按批次计算每个垃圾箱的有限产量.

I'm wrangling some data where we sort fails into bins and compute limited yields for each sort bin by lot.

我有一个描述排序箱的元表. 行以升序排列,并且某些排序标签带有非语法名称.

sort_tbl <- tibble::tribble(~weight,   ~label,
                                  0, "fail A",
                                  0, "fail B",
                                  0, "fail C",
                                100,   "pass")
> sort_tbl
# A tibble: 4 x 2
  weight  label
   <dbl>  <chr>
1      0 fail A
2      0 fail B
3      0 fail C
4    100   pass

我有一个按分类箱的产量有限的数据表,每手一行一行,每个分类箱一个col.由于此表是根据换位构造的,因此我们得到了很多情况下从未发生特定排序的实例,其结果为 NA .请注意,此表中的列按降序排列.

I have a data table of limited yield by sort bin with one row per lot and one col for each sort bin. Because this table was constructed from a transposition we get instances where a particular sort never occurred for a lot and the resulting value is NA. Note that the columns in this table are arranged in descending test order.

yld_tbl <- tibble::tribble(  ~lot, ~pass, ~`fail C`, ~`fail B`, ~`fail A`,
                           "lot1",    NA,        NA,      0.00,        NA,
                           "lot2",    NA,      0.00,      0.80,        NA,
                           "lot3",  0.49,        NA,      0.50,      0.98,
                           "lot4",  0.70,      0.95,      0.74,      0.99)
> yld_tbl
# A tibble: 4 x 5
    lot  pass `fail C` `fail B` `fail A`
  <chr> <dbl>    <dbl>    <dbl>    <dbl>
1  lot1    NA       NA     0.00       NA
2  lot2    NA     0.00     0.80       NA
3  lot3  0.49       NA     0.50     0.98
4  lot4  0.70     0.95     0.74     0.99

某些丢失的值表示有限的100%的收益,而其他的值反映的是未定义的值,因为我们在流程中更早地将收益设为零.我的任务是将 NA 的前一组替换为 1.00 .

Some of the missing values imply a limited yield of 100% while others reflect an undefined value because we are zero yield earlier in the flow. My task is to replace the former group of NA's with 1.00 as appropriate.

如果后续的有限收益不是 NA ,则一种算法可以从左到右(降序测试)将 NA 替换为 1.00 .在示例数据集的第一行中,我们没有更改 fail C ,因为缺少 pass .但是我们确实将 fail A 替换为 1.00 ,因为不缺少 fail B .

One algorithm to accomplish this works left to right (descending test order) replacing NA with 1.00 if the subsequent limited yield is not NA. In the first row of the example data set, we don't change fail C since pass is missing. But we do replace fail A with 1.00 since fail B is not missing.

正确的示例输出为:

> fill_ones(yld_tbl, sort_tbl)
# A tibble: 4 x 5
    lot  pass `fail C` `fail B` `fail A`
  <chr> <dbl>    <dbl>    <dbl>    <dbl>
1  lot1    NA       NA     0.00     1.00
2  lot2    NA     0.00     0.80     1.00
3  lot3  0.49     1.00     0.50     0.98
4  lot4  0.70     0.95     0.74     0.99

推荐答案

如果您将其认为是首先将所有NA替换为1,然后将第一个0替换为NA,则将所有的1替换为该问题",使此问题变得容易一些.

This problem becomes a bit easier if you think of it as "first replace all the NAs with 1, then replace all 1s after the first 0 with NA."

这里有两种方法,一种是使用矩阵运算,另一种是使用dplyr.

Here are two approaches, one using matrix operations and one using dplyr.

在矩阵方法中,您将值提取为数字矩阵,使用 apply 查找需要用NA替换的位置,然后将其返回.

In the matrix approach, you'd extract the values as a numeric matrix, use apply to find the positions that need to be replaced with NA, and return them.

# extract as a matrix, with left-to-right bins
m <- as.matrix(yld_tbl[, sort_tbl$label])

# replace NAs with 1
m[is.na(m)] <- 1

# find 1s happening after a zero in each row
after_zero <- t(apply(m == 0, 1, cumsum)) & (m == 1)

# replace them with NA
m[after_zero] <- NA

# return them in the table
yld_tbl[, sort_tbl$label] <- m


使用dplyr/tidyr,您首先要 gather()列(使用 arrange()使其按所需顺序排列),替换NA( group_by / mutate 完成与上述 apply 相同的操作),然后将它们 spread 扩展成较宽的格式.


Using dplyr/tidyr, you'd first gather() the columns (using arrange() to put them in the desired order), replace the NAs (the group_by/mutate is accomplishing the same thing as apply above), and spread them back into a wide format.

library(dplyr)
library(tidyr)

yld_tbl %>%
  gather(label, value, -lot) %>%
  arrange(lot, match(label, sort_tbl$label)) %>%
  replace_na(list(value = 1)) %>%
  group_by(lot) %>%
  mutate(value = ifelse(cumsum(value == 0) > 0 & value == 1, NA, value)) %>%
  spread(label, value)

请注意,与基于矩阵的方法不同,这不会保留列的顺序.

Note that unlike the matrix-based approach, this does not preserve the ordering of the columns.

这篇关于通过间接引用列来修改数据框中的某些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆