使用dplyr根据阈值创建新的数据框 [英] Using dplyr to create new dataframe depending on thresholds

查看:42
本文介绍了使用dplyr根据阈值创建新的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

   Groups Names COL1  COL2  COL3        COL4
1      G1   SP1    1 0.400 0.500   Sequence1
2      G1   SP1    1 0.004 0.005   Sequence2
3      G1   SP1    0 0.004 0.005   Sequence3
4      G1   SP2    0 0.400 0.005 Sequence123
5      G1   SP2    0 0.004 0.500  Sequence14
6      G1   SP3    0 0.005 0.006  Sequence15
7      G1   SP5    1 0.400 0.006  Sequence16
8      G1   SP6    1 0.008 0.002  Sequence20
10     G2   Sp1    0 0.004 0.005  Sequence17
11     G2   SP1    0 0.050 0.600  Sequence18
12     G2   SP1    0 0.400 0.600   Sequence3
13     G2   SP2    0 0.004 0.005  Sequence22
14     G2   SP2    0 0.004 0.005  Sequence23
15     G2   SP5    0 0.004 0.005  Sequence16
16     G2   SP6    0 0.003 0.002  Sequence21
17     G2   SP7    0 0.560 0.760  Sequence67

这是 dput :

dput(test_df)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), 
    Names = structure(c(2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 1L, 2L, 
    2L, 3L, 3L, 5L, 6L, 7L), .Label = c("Sp1", "SP1", "SP2", 
    "SP3", "SP5", "SP6", "SP7"), class = "factor"), COL1 = c(1L, 
    1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), COL2 = c(0.4, 0.004, 0.004, 0.4, 0.004, 0.005, 0.4, 0.008, 
    0.004, 0.05, 0.4, 0.004, 0.004, 0.004, 0.003, 0.56), COL3 = c(0.5, 
    0.005, 0.005, 0.005, 0.5, 0.006, 0.006, 0.002, 0.005, 0.6, 
    0.6, 0.005, 0.005, 0.005, 0.002, 0.76), COL4 = structure(c(1L, 
    8L, 13L, 2L, 3L, 4L, 5L, 9L, 6L, 7L, 13L, 11L, 12L, 5L, 10L, 
    14L), .Label = c("Sequence1", "Sequence123", "Sequence14", 
    "Sequence15", "Sequence16", "Sequence17", "Sequence18", "Sequence2", 
    "Sequence20", "Sequence21", "Sequence22", "Sequence23", "Sequence3", 
    "Sequence67"), class = "factor")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "10", "11", "12", "13", "14", 
"15", "16", "17"))
and from this dataf

我希望获得另一个数据框,例如:

rame I whant to get another dataframe such as :

    G1  G2
SP1 A   B
SP2 x   x
SP3 x   NA
SP4 NA  NA
SP5 A   X
SP6 a x
SP7 NA b

这个想法是为每个组添加行中存在的名称,并在单元格中添加字母A,B,X或NA,大写或小写取决于我们是否发现至少一个相同的COL4值注释器组中的物种.

The idea is for each Groups to add the Names that are present in the row and add letters A,B, X or NA in the cells and the upper or lowercases will depend if we find an identic COL4 value for at least one species in anotger Groups.

    Name any 行的 COL1> 0 并且至少有一个Names具有相同的名称时,将放置
  • A与不同组中相同名称的 COL4 内容
  • Name any 行的 COL1> 0 并且不存在具有相同 a> COL4 内容与不同组中的相同名称当 Name any 行具有 COL1 = 0 COL2 AND COL3>0.05 ,并且至少有一个名称与不同组中的相同名称具有相同的 COL4 内容当 Name any 行具有 COL1 = 0 COL2 时,放入
  • bAND COL3>0.05 ,并且在不同的组中没有具有与相同名称相同的 COL4 内容的名称
  • Name all 行具有 COL1 = 0 COL2``OR COL3>0.05 AND 至少有一个名称与不同组中的相同名称具有相同的 COL4 内容当 Name all 行具有 COL1 = 0 COL2 时,放置
  • x或 COL3>0.05 AND 在不同组中没有与相同名称具有相同 COL4 内容的名称
  • Group 中没有 Name 时,放入
  • NA
  • A is put when any row for the Name have a COL1 >0 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
  • a is put when any row for the Name have a COL1 >0 AND there is no Names that have the same COL4 content as the same Name in a different Groups
  • B is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
  • b is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
  • X is put when all row for the Name have a COL1=0 AND COL2``OR COL3 > 0.05 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
  • x is put when all row for the Name have a COL1=0 AND COL2 OR COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
  • NA is put when there is not the Name in the Group

让我们举4个例子:

1)对于 G1-SP1 ,我们看到 row1 具有 COL1>0 ,那么在新数据框中它将有一个字母 A a .现在,为了知道它是 A 还是 a ,我们必须查看 COL4 ,我们在row12 Sequence3 也出现在 SP1 G2 中,因此它将是'A'

1) We see for the G1-SP1 that the row1 has a COL1 > 0, then it will have a letter A or a in the new dataframe. Now in order to know if it will be an A or an a we have to look at the COL4, we see in the row12 the Sequence3 is also present in the G2 for the SP1, so it will be an 'A'

2)对于 G2-SP1 ,我们看到 row12 具有 COL2 ,而 COL3 >.0.05 ,则新数据框中的字母为 B b .它将是 B ,因为在 G1 row3 中, Sequence3 也在SP1的G2中存在..

2) We see for the G2-SP1 that the row12 has a COL2 and COL3 are > 0.05, then it will have a letter B or b in the new dataframe. And it will be B because in the G1, row3 the Sequence3 is also present in the G2 for the SP1.

3)对于 G2-SP2 ,我们看到没有行具有 COL1> 0X COL2 COL3 是<代码>>0.05 ,则新数据框中的字母为 B x .这将是 x ,因为其他 Group 中没有其他的 SP2 具有相同的Sequence`(Sequence22,Sequence23或Sequence24)

3) We see for the G2-SP2 that none row has a COL1 >0X or COL2 and COL3 are > 0.05, then it will have a letter B or x in the new dataframe. And it will be x because none other SP2 in other Groups have the same Sequence `(Sequence22,Sequence23 or Sequence24)

4)对于 G1-SP6 ,我们看到 row8 具有 COL1>0 ,那么在新数据框中它将有一个字母 A a .这将是 a ,因为其他 Group 中的其他 SP1 没有相同的Sequence (Sequence20)

4) We see for the G1-SP6 the row8 has a COL1 > 0, then it will have a letter A or a in the new dataframe. And it will be a because none other SP1 in other Groups have the same Sequence (Sequence20)

`

为此,我尝试:

Env_table<-as.data.frame(test_df) %>%
  group_by(Groups,Names) %>%
  mutate(Env_variable = replace_na(COL1, "."),
         Env_variable = ifelse(any(COL1 >=1) , "A", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 >0.05) , "B", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 >0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0) && all(!is.na(COL1)) && all(COL2 >0.05) && all(COL3 >0.05) , "*", Env_variable))%>%
  slice(1) %>%
  pivot_wider(id_col = Names, names_from = Groups, values_from = Env_variable) %>%
  arrange(as.integer(str_extract(Names, "\\d+")))

其中 Env_variable 是一个空列,它将存储A,B,X或NA值.

where Env_variable is juste an empty column that will store the A,B,X or NA values.

感谢您的帮助

推荐答案

您的问题并不十分清楚,但这是尝试回答的问题:

Your question is not crystal-clear, but here is an attempt to answer:

test_df %>% 
  group_by(Groups, Names) %>% 
  summarise(
    x=case_when(
      any(COL1>=1, na.rm=TRUE) ~ "A",
      any(COL1==0 & (COL2>0.05 & COL3>0.05), na.rm=TRUE) ~ "B",
      any(COL1==0 & (COL2<0.05 | COL3<0.05), na.rm=TRUE) ~ "X",
      TRUE ~ NA_character_
    )
  ) %>% 
  pivot_wider(names_from = Groups, values_from = x)

这将给出以下输出:

  Names G1    G2   
  <fct> <chr> <chr>
1 SP1   A     B    
2 SP2   X     X    
3 SP3   X     NA   
4 SP5   A     X    
5 SP6   A     X    
6 SP1   NA    X    
7 SP7   NA    B

这篇关于使用dplyr根据阈值创建新的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆