在R中仅删除数据帧中的相邻重复项 [英] Removing Only Adjacent Duplicates in Data Frame in R

查看:92
本文介绍了在R中仅删除数据帧中的相邻重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有一个数据帧,该数据帧应该是重复的.但是,有些重复项需要删除.特别是,我只想删除行相邻的重复项,但保留其余部分.例如,假设我有数据框:

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:

df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"), 
                y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))

这将导致以下数据帧

x   y
A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   8
B   9
C   10

在这种情况下,我希望重复出现"A,B,C,A,B,C等".但是,如果我看到相邻行重复项,这只是一个问题.在上面的示例中,这将是第8行和第9行,其中重复的"B"彼此相邻.

In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.

在我的数据集中,无论何时发生这种情况,第一个实例始终是用户错误,第二个实例始终是正确的版本.在极少数情况下,可能会出现重复发生3次(或更多次)的情况.但是,在每种情况下,我总是希望保留最后一次出现.因此,按照上面的示例,我希望最终数据集看起来像

In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like

A   1
B   2
C   3
A   4
B   5
C   6
A   7
B   9
C   10

在R中有一种简单的方法吗?预先感谢您的帮助!

Is there an easy way to do this in R? Thank you in advance for your help!

修改:美国东部时间2014年11月19日中午12:14 用户Akron发布了一个解决方案(拼写?),此解决方案此后被删除.我现在确定为什么会因为它似乎对我有用?

11/19/2014 12:14 PM EST There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?

解决方案是

df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]

这似乎对我有用,为什么将其删除?例如,在连续重复超过2次的情况下:

It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:

df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   x  y
1  A  1
2  B  2
3  B  3
4  B  4
5  C  5
6  C  6
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14

> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
   x  y
1  A  1
4  B  4
7  C  7
8  A  8
9  B  9
10 C 10
11 A 11
13 B 13
14 C 14

这似乎行得通吗?

推荐答案

尝试

 df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
#   x  y
#1  A  1
#2  B  2
#3  C  3
#4  A  4
#5  B  5
#6  C  6
#7  A  7
#9  B  9
#10 C 10

说明

在这里,我们正在将一个元素与其之前的元素进行比较.可以通过从列中删除first element并将该列与要删除last element的列进行比较(以使长度相等)来完成

Explanation

Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)

 df$x[-1] #first element removed
 #[1] B C A B C A B B C
 df$x[-nrow(df)]
  #[1] A B C A B C A B B #last element `C` removed

 df$x[-1]!=df$x[-nrow(df)]
 #[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

在上面,当我们删除一个元素时,长度比dfnrow1.为了弥补这一点,我们可以连接一个TRUE,然后使用该index子集数据集.

In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.

这篇关于在R中仅删除数据帧中的相邻重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆