查找重复行的索引 [英] Find indices of duplicated rows

查看:158
本文介绍了查找重复行的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在R中复制的功能执行重复的行搜索。如果我们要删除重复的内容,我们只需要写入 df [!duplicateated(df),] ,并从数据框中删除副本。



但是如何找到重复数据的索引?如果复制在某行上返回TRUE,则意味着这是数据帧中这样一行的第二次出现,并且可以轻松获取其索引。如何获取此行首次发生的索引?或者换句话说,重复行的索引是相同的?



我可以在data.frame上循环,但我认为有一个更优雅的答案

解决方案

这将返回一个逻辑索引向量:

  duplicateated(df)|重复(df [nrow(df):1,])[nrow(df):1] 

这是一个例子:

  df<  -  data.frame(a = c(1,2,3,4,1, 5,6,4,2,1))

duplicateated(df)|重复(df [nrow(df):1,])[nrow(df):1]
#[1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE

(df)| duplicateed(df [nrow(df):1,])[nrow(df):1])$ ​​b $ b#[1] 1 2 4 5 8 9 10

更新(基于注释):

命令的复杂性可以减少,如果 fromLast = TRUE 用作函数参数。这比创建两个反向向量更容易。

  duplicateed(df)|重复(df,fromLast = TRUE)

duplicateated(df)|重复(df,fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE



如何工作?



复制的功能应用于原始数据帧和数据帧具有相反的行顺序。后者的输出再次反转。请注意,原始数据中第一次重复出现的值是反转版本中的最后一次出现。之后,两个向量都使用 | 进行组合,因为其中至少有一个 TRUE 表示重复的值。 p>

Function duplicated in R performs duplicate row search. If we want to remove the duplicates, we need just to write df[!duplicated(df),] and duplicates will be removed from data frame.

But how to find the indices of duplicated data? If duplicated returns TRUE on some row, it means, that this is the second occurence of such a row in the data frame and its index can be easily obtained. How to obtain the index of first occurence of this row? Or, in other words, an index with which the duplicated row is identical?

I could make a loop on data.frame, but I think there is a more elegant answer on this question.

解决方案

This returns a logical index vector:

duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1]

Here's an example:

df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))

duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1]
#[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

which(duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1])
#[1]  1  2  4  5  8  9 10

Update (based on comment):
The complexity of the command can be reduced if fromLast = TRUE is used as function argument. This is easier than creating two reversed vectors.

duplicated(df) | duplicated(df, fromLast = TRUE)

duplicated(df) | duplicated(df, fromLast = TRUE)
#[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

How it works?

The function duplicated is applied to both the original data frame and the data frame with reversed order of rows. The output of the latter is reversed again. Note that the first occurrences of duplicated values in the original data are the last occurrences in the reversed version. Afterwards, both vectors are combined using | since a TRUE in at least one of them indicates a duplicated value.

这篇关于查找重复行的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆