在dplyr中无法使用(有时) [英] Distinct in dplyr does not work (sometimes)

查看:69
本文介绍了在dplyr中无法使用(有时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从计数中获得的以下数据帧.我已经使用 dput 使数据框可用,然后编辑了该数据框,因此存在 A 的副本.

I have the following data frame which I have obtained from a count. I have used dput to make the data frame available and then edited the data frame so there is a duplicate of A.

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                         class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
              class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))

print(df)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

现在,我想对Procedure进行区分,只保留第一个 A .

Now I would like to take distinct on Procedure and only keep the first A.

df %>% 
  distinct(Procedure, .keep_all=TRUE)

# A tibble: 4 x 2
  Procedure     n
  <fct>     <int>
1 D         10717
2 A          4412
3 A          2058
4 C          1480

它不起作用.奇怪...

It does not work. Strange...

推荐答案

如果我们打印 Procedure 列,我们可以看到 a 有重复的级别,对于 distinct 函数是有问题的.

If we print the Procedure column, we can see that there are duplicated levels for a, which is problematic for the distinct function.

df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor

一种解决方法是降低因子水平.我们可以使用 factor 函数来实现这一点.另一种方法是将 Procedure 列转换为字符.

One way to fix is to drop the factor levels. We can use factor function to achieve this. Another way is to convert the Procedure column to character.

df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"), 
                                           class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))


library(tidyverse)

df %>% 
  mutate(Procedure = factor(Procedure)) %>%
  distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
#   Procedure     n
#   <fct>     <int>
# 1 D         10717
# 2 A          4412
# 3 C          1480

这篇关于在dplyr中无法使用(有时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆