R - 基于两列识别和删除重复的行 [英] R - Identify and remove duplicate rows based on two columns
问题描述
我有一些看起来像这样的数据:
I have some data that looks like this:
Course_ID Text_ID
33 17
33 17
58 17
5 22
8 22
42 25
42 25
17 26
17 26
35 39
51 39
没有编程背景,我发现表达我的问题很棘手,但这里是:我只想保留 Course_ID
变化但 Text_ID
变化的行> 是一样的.因此,例如,最终数据将如下所示:
Not having a background in programming, I'm finding it tricky to articulate my question, but here goes: I only want to keep rows where Course_ID
varies but where Text_ID
is the same. So for example, the final data would look something like this:
Course_ID Text_ID
5 22
8 22
35 39
51 39
如您所见,只有 Text_ID
22 和 39 具有不同的 Course_ID
值.我怀疑对数据进行子集化是可行的方法,但正如我所说,我在这方面是个新手,非常感谢有关如何处理此问题的任何建议.
As you can see, Text_ID
22 and 39 are the only ones that have different Course_ID
values. I suspect subsetting the data would be the way to go, but as I said, I'm quite a novice at this kind of thing and would really appreciate any advice on how to approach this.
推荐答案
选择那些没有重复Course_ID
的组.
Select those groups where there is no repeats of Course_ID
.
在 dplyr
中你可以把它写成 -
In dplyr
you can write this as -
library(dplyr)
df %>% group_by(Text_ID) %>% filter(n_distinct(Course_ID) == n()) %>% ungroup
# Course_ID Text_ID
# <int> <int>
#1 5 22
#2 8 22
#3 35 39
#4 51 39
和data.table
-
library(data.table)
setDT(df)[, .SD[uniqueN(Course_ID) == .N], Text_ID]
这篇关于R - 基于两列识别和删除重复的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!