仅删除 R 中数据框中的相邻重复项 [英] Removing Only Adjacent Duplicates in Data Frame in R

查看:27
本文介绍了仅删除 R 中数据框中的相邻重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个 应该是重复的数据框.但是,我需要删除一些重复项.特别是,我只想删除行相邻的重复项,但保留其余的.例如,假设我有数据框:

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:

df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"), 
                y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

这会产生以下数据框

x   y
A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   8
B   9
C   10

在这种情况下,我希望重复A、B、C、A、B、C 等".但是,只有当我看到 adjacent row 重复时才会出现问题.在我上面的示例中,这将是第 8 行和第 9 行,其中重复的B"彼此相邻.

In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.

在我的数据集中,无论何时发生这种情况,第一个实例始终是用户错误,第二个实例始终是正确的版本.在极少数情况下,可能会出现重复出现 3 次(或更多)次的情况.但是,在每种情况下,我总是希望保留最后一次出现.因此,按照上面的示例,我希望最终的数据集看起来像

In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like

A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   9
C   10

在 R 中有一种简单的方法可以做到这一点吗?提前感谢您的帮助!

Is there an easy way to do this in R? Thank you in advance for your help!

美国东部标准时间 2014 年 11 月 19 日下午 12:14用户 Akron(拼写?)发布了一个解决方案,该解决方案已被删除.我现在知道为什么了,因为它似乎对我有用?

11/19/2014 12:14 PM EST There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?

解决办法是

df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]

它似乎对我有用,为什么它被删除了?例如,如果连续重复超过 2 个:

It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:

df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   x  y
1  A  1
2  B  2
3  B  3
4  B  4
5  C  5
6  C  6
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14

> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
   x  y
1  A  1
4  B  4
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
13 B 13
14 C 14

这似乎行得通?

推荐答案

试试

 df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
#   x  y
#1  A  1
#2  B  2
#3  C  3
#4  A  4
#5  B  5
#6  C  6
#7  A  7
#9  B  9
#10 C 10

说明

在这里,我们将一个元素与它之前的元素进行比较.这可以通过从列中删除 first element 并将该列与删除 last element 的列进行比较来完成(以便长度相等)

Explanation

Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)

 df$x[-1] #first element removed
 #[1] B C A B C A B B C
 df$x[-nrow(df)]
  #[1] A B C A B C A B B #last element `C` removed

 df$x[-1]!=df$x[-nrow(df)]
 #[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

在上面,当我们删除了一个元素时,长度1小于dfnrow.为了弥补这一点,我们可以连接一个 TRUE,然后使用这个 index 对数据集进行子集化.

In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.

这篇关于仅删除 R 中数据框中的相邻重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆