R dplyr 替换“缺失"具有第一个非“缺失"的列数据价值 [英] R dplyr replace "missing" column data with first non-"missing" value
问题描述
在标题(或谷歌)中简洁地描述这是一个棘手的问题.我有一个分类表,其中某些列可能会根据置信水平列为已删除".我想替换任何写着dropped"的列.与身份不明"后跟第一列中没有显示已删除"的值,以行方式.所以,输入看起来像这样:
This is a tricky one to describe concisely in a headline (or to google). I have a taxonomy table where some columns may be listed as "dropped' based on a confidence level. I'd like to replace any column that says "dropped" with "Unidentified" followed by the value from the first column that doesn't say "dropped", in a row-wise fashion. So, the input would look like this:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota dropped dropped dropped
#> 2 Eukaryota dropped dropped dropped
#> 3 Eukaryota dropped dropped dropped
#> 4 Eukaryota dropped dropped dropped
#> 5 Eukaryota dropped dropped dropped
#> 6 Eukaryota dropped dropped dropped
#> 7 Eukaryota Hexanauplia Calanoida dropped
#> 8 Eukaryota dropped dropped dropped
#> 9 Eukaryota Dinophyceae Syndiniales dropped
#> 10 Animals Polychaeta Terebellida dropped
#> 11 Eukaryota Acantharia Chaunacanthida dropped
#> 12 Eukaryota dropped dropped dropped
#> 13 Animals Ascidiacea Stolidobranchia dropped
#> 14 Eukaryota Haptophyta dropped dropped
#> 15 Eukaryota dropped dropped dropped
#> 16 Eukaryota dropped dropped dropped
#> 17 Eukaryota dropped dropped dropped
#> 18 Animals Ascidiacea Stolidobranchia dropped
#> 19 Eukaryota dropped dropped dropped
#> 20 Eukaryota dropped dropped dropped
输出应该是这样的:
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals Polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
我使用 purrr::pmap_dfr
提出了一个很好的解决方案,但我很想知道是否有更纯"的解决方案.dplyr
怎么做?我的方法的一个缺陷是它不适用于第一个非丢弃"的列.列在一个或多个丢弃"之后列(请参阅下面输出中的第 21 行).这是我目前的解决方案:
I've come up with a fine solution using purrr::pmap_dfr
but I'm curious to know if there's a more "pure" dplyr
way to do it? The one flaw in my method is that it doesn't work for columns where the first non-"dropped" column comes after one or more "dropped" columns (see row 21 in the output below). Here's my current solution:
library(tidyverse)
otu_table <- structure(list(domain = c("Eukaryota", "Eukaryota", "Eukaryota",
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota",
"Eukaryota", "Animals", "Eukaryota", "Eukaryota", "Animals",
"Eukaryota", "Eukaryota", "Eukaryota", "Eukaryota", "Animals",
"Eukaryota", "Eukaryota", "dropped"), class = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "Hexanauplia", "dropped",
"Dinophyceae", "Polychaeta", "Acantharia", "dropped", "Ascidiacea",
"Haptophyta", "dropped", "dropped", "dropped", "Ascidiacea",
"dropped", "dropped", "not dropped"), order = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "Calanoida", "dropped",
"Syndiniales", "Terebellida", "Chaunacanthida", "dropped", "Stolidobranchia",
"dropped", "dropped", "dropped", "dropped", "Stolidobranchia",
"dropped", "dropped", "dropped"), species = c("dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped", "dropped", "dropped", "dropped", "dropped", "dropped",
"dropped")), row.names = c(NA, -21L), class = c("tbl_df", "tbl",
"data.frame"))
tax_data <- otu_table %>%
pmap_dfr(~{
items <- list(...)
first_dropped = match("dropped",items)
if (first_dropped > 1) {
dropped_name <- str_c("Unidentified ",items[first_dropped-1])
} else {
dropped_name <- "Unidentified"
}
items[-c(1:first_dropped-1)] <- dropped_name
items
})
print(tax_data,n=30)
#> # A tibble: 21 x 4
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 2 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 3 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 4 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 5 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 6 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida Unidentified Calanoida
#> 8 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales Unidentified Syndiniales
#> 10 Animals Polychaeta Terebellida Unidentified Terebellida
#> 11 Eukaryota Acantharia Chaunacanthida Unidentified Chaunacanth…
#> 12 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 13 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 14 Eukaryota Haptophyta Unidentified Haptop… Unidentified Haptophyta
#> 15 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 16 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 17 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 18 Animals Ascidiacea Stolidobranchia Unidentified Stolidobran…
#> 19 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 20 Eukaryota Unidentified Eukary… Unidentified Eukary… Unidentified Eukaryota
#> 21 dropped not dropped dropped dropped
更新:
下面有一些不错的答案.我接受了得票最多的那个,但事实证明,在通过 microbenchmark
运行所有建议后,purrr
解决方案的速度快了几乎一个数量级.
Some good answers below. I've accepted the one with the most upvotes, but it turns out that after running all the suggestions through microbenchmark
, the purrr
solution is the fastest by almost an order of magnitude.
推荐答案
这是另一种方法,将 rowwise()
与 across()
结合使用.
Here's another approach, using rowwise()
in combination with across()
.
- 我们使用
rowwise
因为它有助于通过cur_data()
将行用作单个向量 across(everything(), ~)
帮助我们一次改变所有列max.col(cur_data() != 'dropped', ties.method = 'last')
将检索值!= 'dropped' 的最后一列索引
- 我们将它的列名存储在一个临时变量中,比如
x
- 最后,我们使用基础 R 中的
if()..else
只改变那些值被dropped
的列
- We are using
rowwise
because it helps in using a row as a single vector throughcur_data()
across(everything(), ~)
helps us in mutating all columns at oncemax.col(cur_data() != 'dropped', ties.method = 'last')
will retrieve last column index where the value!= 'dropped'
- we store its column name in a temp variable say
x
- lastly we use
if()..else
from base R to mutate only those columns where value isdropped
希望答案足够清楚
library(tidyverse)
otu_table %>% rowwise() %>%
mutate(across(everything(), ~ {x<- names(cur_data())[max.col(cur_data() != 'dropped', ties.method = 'last')];
if (. == 'dropped') paste0('unidentified ', get(x)) else . }))
#> # A tibble: 21 x 4
#> # Rowwise:
#> domain class order species
#> <chr> <chr> <chr> <chr>
#> 1 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 2 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 3 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 4 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 5 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 6 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 7 Eukaryota Hexanauplia Calanoida unidentified Calanoida
#> 8 Eukaryota unidentified Eukaryo~ unidentified Eukaryo~ unidentified Eukaryota
#> 9 Eukaryota Dinophyceae Syndiniales unidentified Syndinial~
#> 10 Animals Polychaeta Terebellida unidentified Terebelli~
#> # ... with 11 more rows
由 reprex 包 (v2.0.0) 于 2021 年 6 月 19 日创建上>
Created on 2021-06-19 by the reprex package (v2.0.0)
这篇关于R dplyr 替换“缺失"具有第一个非“缺失"的列数据价值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!