使用dplyr根据阈值创建新的数据框 [英] Using dplyr to create new dataframe depending on thresholds

查看：42 发布时间：2021/5/2 20:46:30 r dataframe dplyr

本文介绍了使用dplyr根据阈值创建新的数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

   Groups Names COL1  COL2  COL3        COL4
1      G1   SP1    1 0.400 0.500   Sequence1
2      G1   SP1    1 0.004 0.005   Sequence2
3      G1   SP1    0 0.004 0.005   Sequence3
4      G1   SP2    0 0.400 0.005 Sequence123
5      G1   SP2    0 0.004 0.500  Sequence14
6      G1   SP3    0 0.005 0.006  Sequence15
7      G1   SP5    1 0.400 0.006  Sequence16
8      G1   SP6    1 0.008 0.002  Sequence20
10     G2   Sp1    0 0.004 0.005  Sequence17
11     G2   SP1    0 0.050 0.600  Sequence18
12     G2   SP1    0 0.400 0.600   Sequence3
13     G2   SP2    0 0.004 0.005  Sequence22
14     G2   SP2    0 0.004 0.005  Sequence23
15     G2   SP5    0 0.004 0.005  Sequence16
16     G2   SP6    0 0.003 0.002  Sequence21
17     G2   SP7    0 0.560 0.760  Sequence67

这是 dput :

dput(test_df)
structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), 
    Names = structure(c(2L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 1L, 2L, 
    2L, 3L, 3L, 5L, 6L, 7L), .Label = c("Sp1", "SP1", "SP2", 
    "SP3", "SP5", "SP6", "SP7"), class = "factor"), COL1 = c(1L, 
    1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
    ), COL2 = c(0.4, 0.004, 0.004, 0.4, 0.004, 0.005, 0.4, 0.008, 
    0.004, 0.05, 0.4, 0.004, 0.004, 0.004, 0.003, 0.56), COL3 = c(0.5, 
    0.005, 0.005, 0.005, 0.5, 0.006, 0.006, 0.002, 0.005, 0.6, 
    0.6, 0.005, 0.005, 0.005, 0.002, 0.76), COL4 = structure(c(1L, 
    8L, 13L, 2L, 3L, 4L, 5L, 9L, 6L, 7L, 13L, 11L, 12L, 5L, 10L, 
    14L), .Label = c("Sequence1", "Sequence123", "Sequence14", 
    "Sequence15", "Sequence16", "Sequence17", "Sequence18", "Sequence2", 
    "Sequence20", "Sequence21", "Sequence22", "Sequence23", "Sequence3", 
    "Sequence67"), class = "factor")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "10", "11", "12", "13", "14", 
"15", "16", "17"))
and from this dataf

我希望获得另一个数据框，例如:

rame I whant to get another dataframe such as :

    G1  G2
SP1 A   B
SP2 x   x
SP3 x   NA
SP4 NA  NA
SP5 A   X
SP6 a x
SP7 NA b

这个想法是为每个组添加行中存在的名称，并在单元格中添加字母A，B，X或NA，大写或小写取决于我们是否发现至少一个相同的COL4值注释器组中的物种.

The idea is for each Groups to add the Names that are present in the row and add letters A,B, X or NA in the cells and the upper or lowercases will depend if we find an identic COL4 value for at least one species in anotger Groups.

Name

any

COL1> 0

A与不同组中相同名称的 COL4 内容

Name

any

COL1> 0

a> COL4 内容与不同组中的相同名称当 Name 的 any 行具有 COL1 = 0 和 COL2 AND COL3>0.05 ，并且至少有一个名称与不同组中的相同名称具有相同的 COL4 内容当 Name 的 any 行具有 COL1 = 0 和 COL2 时，放入

bAND COL3>0.05 ，并且在不同的组中没有具有与相同名称相同的 COL4 内容的名称

当 Name 的 all 行具有 COL1 = 0 和 COL2``OR COL3>0.05 AND 至少有一个名称与不同组中的相同名称具有相同的 COL4 内容当 Name 的 all 行具有 COL1 = 0 和 COL2

时，放置 x或  COL3>0.05   AND 在不同组中没有与相同名称具有相同 COL4 内容的名称
当 Group  中没有 Name 时，放入 NA



A is put when any row for the Name have a COL1 >0 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
a is put when any row for the Name have a COL1 >0 AND there is no Names that have the same COL4 content as the same Name in a different Groups
B is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
b is put when any row for the Name have a COL1=0 AND COL2 AND COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
X is put when all row for the Name have a COL1=0 AND COL2``OR COL3 > 0.05  AND there is at least one Names that have the same COL4 content as the same Name in a different Groups
x is put when all row for the Name have a COL1=0 AND COL2 OR COL3 > 0.05 AND there is no Names that have the same COL4 content as the same Name in a different Groups
NA is put when there is not the Name in the Group

让我们举4个例子:
 1)对于 G1-SP1 ，我们看到 row1 具有 COL1>0 ，那么在新数据框中它将有一个字母 A 或 a .现在，为了知道它是 A 还是 a ，我们必须查看 COL4 ，我们在row12   Sequence3 也出现在 SP1 的 G2 中，因此它将是'A'
1)
We see for the G1-SP1 that the row1 has a COL1  > 0, then it will have a letter A or a in the new dataframe. 
Now in order to know if it will be an A or an a we have to look at the COL4, we see in the row12 the  Sequence3 is also present in the G2 for the SP1, so it will be an 'A'
 2)对于 G2-SP1 ，我们看到 row12 具有 COL2 ，而 COL3 是>.0.05 ，则新数据框中的字母为 B 或 b .它将是 B ，因为在 G1 ， row3 中， Sequence3 也在SP1的G2中存在..
2) 
We see for the G2-SP1 that the row12 has a COL2 and COL3 are > 0.05, then it will have a letter B or b in the new dataframe.
And it will be B because in the G1, row3 the Sequence3 is also present in the G2 for the SP1.

3)对于 G2-SP2 ，我们看到没有行具有 COL1> 0X 或 COL2 和 COL3 是<代码>>0.05 ，则新数据框中的字母为 B 或 x .这将是 x ，因为其他 Group 中没有其他的 SP2 具有相同的Sequence`(Sequence22，Sequence23或Sequence24)

3) We see for the G2-SP2 that none row has a COL1 >0X or COL2 and COL3 are > 0.05, then it will have a letter B or x in the new dataframe. And it will be x because none other SP2 in other Groups have the same Sequence `(Sequence22,Sequence23 or Sequence24)

4)对于 G1-SP6 ，我们看到 row8 具有 COL1>0 ，那么在新数据框中它将有一个字母 A 或 a .这将是 a ，因为其他 Group 中的其他 SP1 没有相同的Sequence (Sequence20)

4) We see for the G1-SP6 the row8 has a COL1 > 0, then it will have a letter A or a in the new dataframe. And it will be a because none other SP1 in other Groups have the same Sequence (Sequence20)

为此，我尝试:

Env_table<-as.data.frame(test_df) %>%
  group_by(Groups,Names) %>%
  mutate(Env_variable = replace_na(COL1, "."),
         Env_variable = ifelse(any(COL1 >=1) , "A", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 >0.05) , "B", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 <0.05) && all(COL3 >0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0 ) && all(COL2 >0.05) && all(COL3 <0.05) , "X", Env_variable)) %>%
  mutate(Env_variable = ifelse(all(COL1 ==0) && all(!is.na(COL1)) && all(COL2 >0.05) && all(COL3 >0.05) , "*", Env_variable))%>%
  slice(1) %>%
  pivot_wider(id_col = Names, names_from = Groups, values_from = Env_variable) %>%
  arrange(as.integer(str_extract(Names, "\\d+")))

其中 Env_variable 是一个空列，它将存储A，B，X或NA值.

where Env_variable is juste an empty column that will store the A,B,X or NA values.

感谢您的帮助

使用dplyr根据阈值创建新的数据框 [英] Using dplyr to create new dataframe depending on thresholds

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用dplyr根据阈值创建新的数据框 [英] Using dplyr to create new dataframe depending on thresholds

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭