根据另一个不同大小的数据框替换数据框中的选定列的值 [英] Replacing values of selected columns in data frame based on another different size data frame

查看:39
本文介绍了根据另一个不同大小的数据框替换数据框中的选定列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对按特定列值分组的数据帧的数字列执行了CV计算。然后根据结果数据框,在第一个数据框中替换了相关的列值。我通过混合dplyr函数,merge函数和带​​if语句的for循环来完成该操作。我想请您对tidyverse函数进行更短的操作(如果有)。这是步骤和结果示例。

I performed CV calculation over numeric columns of a data frame grouped by a specific column values. Then based on resulting dataframe, I replaced the related column values in first data frame. I did the operation by mixing dplyr functions, merge function, and for loop with if statement. I would like to ask you for a shorter operations of tidyverse functions if any. Here are the steps and example outcome.

我有一个数据框-A

Accession Column1 Column2 Column3 Root ID
2000_1    0       0.2     14      2000  1
2000_2    0.01    0.2     17      2000  2
2001_1    0.012   0.22    11      2001  1
2001_2    0.011   0.231   17      2001  2

对于此演示,我有三个数字列Column1,Column2和Column3,但在实际数据中为> 500.
我将数据框作为子集,以仅保留数字列(Column1,Column2,Column3和Root),而ID列除外。然后,我根据列 Root对行进行分组。并使用汇总函数/还使用小块函数groupby / summarize_if / ungroup将方差系数计算为第二个数据帧。
如果值大于30,我将其替换为NA。这是结果数据框-B

For this demo I have three numerical columns Column1, Column2, and Column3, but in actual data it is > 500. I subset the data frame to keep only numeric columns (Column1, Column2, Column3, and Root) except ID column. Then I grouped the rows based on column "Root" and calculated the coefficient of variance as a second data frame by using aggregate function/also using tibble functions groupby/summarize_if/ungroup. I replaced the values with NA if they are greater than 30. Here is the resulting dataframe- B

 Column1 Column2 Column3 Root
 NA      0       13.68   2000
 6.14    3.44     NA     2001 

我使用合并功能合并了数据帧A和B基于根列。
结果数据帧为

I merged data frame A and B by using merge function based on Root column. Resulting data frame was

Column1.x Column2.x Column3.x Accession Column1.y Column2.y Column3.y Root ID 
    NA       0        13.68     2000_1    0       0.2        14       2000  1
    NA       0        13.68     2000_2    0.01    0.2        17       2000  2
    6.14    3.44      NA        2001_1    0.012   0.22       11       2001  1
    6.14    3.44      NA        2001_2    0.011   0.231      17       2001  2

然后我遍历数字列(第1、2、3列)。对于根列值2000,我将Column1.y值(0和0.01)替换为Column1.x的值是NA。对于Root列值2001,我没有替换Column1.y值不是NA。

Then I looped through numeric columns (Column 1, 2, 3). For Root column value 2000, I replaced Column1.y values (0 and 0.01) with Column1.x values are NA. For Root column value 2001, I didn't replace Column1.y values are not NA.

结果数据框为

    Accession Column1.y Column2.y Column3.y Root ID 
    2000_1    NA      0.2        14       2000  1
    2000_2    NA      0.2        17       2000  2
    2001_1    0.012   0.22       NA       2001  1
    2001_2    0.011   0.231      NA       2001  2


推荐答案

数据:

dfa <- read.table(text="Accession Column1 Column2 Column3 Root ID
2000_1    0       0.2     14      2000  1
2000_2    0.01    0.2     17      2000  2
2001_1    0.012   0.22    11      2001  1
2001_2    0.011   0.231   17      2001  2", header = T)

库和功能:

library(tidyverse)

cv <- function(x) 100 * (sd(x) / mean(x))


解决方案:


如果我们紧追其后并考虑最终结果,基本上您想将 Column1:Column3 中的值替换为 NA 如果CV大于30。否则,您要保留原始值。下面的代码可以做到这一点。

Solution:

If we cut to the chase and consider the end result, basically you want to replace the values in Column1:Column3 with NA if CV is greater than 30. Otherwise, you want to preserve the original values. The code below does that.

dfa %>% 
  group_by(Root) %>% 
  mutate_at(vars(Column1:Column3),
            list(~ if(cv(.) > 30) NA else .)) 

结果:

#> # A tibble: 4 x 6
#>   Accession Column1 Column2 Column3  Root    ID
#>   <fct>       <dbl>   <dbl>   <dbl> <int> <int>
#> 1 2000_1     NA       0.2        14  2000     1
#> 2 2000_2     NA       0.2        17  2000     2
#> 3 2001_1      0.012   0.22       NA  2001     1
#> 4 2001_2      0.011   0.231      NA  2001     2






更复杂方法:


如果我们想遵循您的思路,那么我们将得到一个更复杂的代码,如下所示;



More complicated approaches:

If we want to follow your train of thoughts, then we'll end up with a more complicated code which is illustrated below;

dfa %>% 
  select_if(function(col) is.numeric(col) & all(col != .$ID)) %>% 
  group_by(Root) %>% 
  summarise_each(list(cv)) %>% 
  mutate_at(vars(Column1:Column3),
            list(~ ifelse(. > 30, NA, 0))) %>% 
  left_join(dfa[,c("Root", "ID")], . , by = "Root") %>% 
  bind_rows(dfa, .) %>% 
  group_by(Root, ID) %>% 
  summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>% 
  ungroup %>% 
  select(-ID, -Root, everything())

说明:


  1. 选择数字列除了 ID

  2. Root 分组。

  3. 计算所有列的CV。

  4. 使用 NA 替换大于30的CV值,其余部分替换为0。我计划将这些值与原始值相加,因为OP对保存NA感兴趣(即大于30),但其他值在原始数据集中保持不变。因此,与0相加会使后者保持不变,而那些NA( na.rm = F )将影响值。

  5. 添加ID列通过合并使CV矩阵与原始数据集大小(行方向)相同来返回。此外,它将在以后用于分组。

  6. 按行绑定数据集。

  7. Root ID

  8. 汇总数字列(即 Column1 Column2 等),将原始数据框和修改后的CV矩阵中的值相加,并保留其他列中的第一个值(因为原始数据框在<$ c $中是第一个) c> bind_rows 表示保留原始值)。

  9. 取消分组以避免将来发生冲突。


  1. selecting numeric columns except ID.
  2. grouping by Root.
  3. Calculating CV for all the columns.
  4. Replacing CV values greater than 30 with NA and the rest with 0. I am planing to sum these with the original values as it seems that OP is interested in preserving the NAs (i.e. greater than 30) from this CV matrix but keep the other values unchanged in the original dataset. So summing with 0 keep the latter unchanged while those NAs (na.rm = F) will affect the values.
  5. Adding ID column back by joining to make the CV matrix the same size (rowwise) of the original dataset. Moreover, it will be used for grouping later.
  6. Binding the datasets by rows.
  7. grouping by Root and ID.
  8. Summarizing numeric columns (i.e. Column1, Column2, etc.) by summing the values from original dataframe and modified CV matrix and keeping the first value from other columns (since the original dataframe was first in bind_rows that means preserving the original values).
  9. Ungrouping to avoid future conflicts.
  10. Rearranging columns in the order that OP presented.




另一种解决方案 是与上面的内容非常相似,但是除了加入以获取ID列并扩展CV矩阵外,还可以从一开始就通过将其汇总为列表列并随后取消嵌套来保留它们。


Another solution would be very similar to what's above, but instead of joining to get the ID column and expand the CV matrix, one can preserve them from the beginning by summarizing as list column and later unnesting them.

dfa %>% 
  mutate(ID = as.factor(ID)) %>% 
  group_by(Root) %>% 
  summarise_each(list(~ if(is.numeric(.)) cv(.) else list(.))) %>% 
  mutate_at(vars(Column1:Column3),
            list(~ ifelse(. > 30, NA, 0))) %>% 
  unnest(cols = c(Accession, ID)) %>% 
  mutate(ID = as.integer(ID)) %>% 
  bind_rows(dfa, .) %>% 
  group_by(Root, ID) %>% 
  summarise_each(list(~ if(is.numeric(.)) sum(., na.rm = FALSE) else first(.))) %>% 
  ungroup %>% 
  select(-ID, -Root, everything())

这篇关于根据另一个不同大小的数据框替换数据框中的选定列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆