查找重复行的索引 [英] Find indices of duplicated rows
问题描述
在R中复制的功能执行重复的行搜索。如果我们要删除重复的内容,我们只需要写入 df [!duplicateated(df),]
,并从数据框中删除副本。
但是如何找到重复数据的索引?如果复制
在某行上返回TRUE,则意味着这是数据帧中这样一行的第二次出现,并且可以轻松获取其索引。如何获取此行首次发生的索引?或者换句话说,重复行的索引是相同的?
我可以在data.frame上循环,但我认为有一个更优雅的答案
这将返回一个逻辑索引向量:
duplicateated(df)|重复(df [nrow(df):1,])[nrow(df):1]
这是一个例子:
df< - data.frame(a = c(1,2,3,4,1, 5,6,4,2,1))
duplicateated(df)|重复(df [nrow(df):1,])[nrow(df):1]
#[1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
(df)| duplicateed(df [nrow(df):1,])[nrow(df):1])$ b $ b#[1] 1 2 4 5 8 9 10
更新(基于注释):
命令的复杂性可以减少,如果 fromLast = TRUE
用作函数参数。这比创建两个反向向量更容易。
duplicateed(df)|重复(df,fromLast = TRUE)
duplicateated(df)|重复(df,fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
如何工作?
复制的功能 Function duplicated in R performs duplicate row search. If we want to remove the duplicates, we need just to write But how to find the indices of duplicated data? If I could make a loop on data.frame, but I think there is a more elegant answer on this question. This returns a logical index vector: Here's an example: Update (based on comment):
The function 这篇关于查找重复行的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!应用于原始数据帧和数据帧具有相反的行顺序。后者的输出再次反转。请注意,原始数据中第一次重复出现的值是反转版本中的最后一次出现。之后,两个向量都使用
|
进行组合,因为其中至少有一个 TRUE
表示重复的值。 p> df[!duplicated(df),]
and duplicates will be removed from data frame. duplicated
returns TRUE on some row, it means, that this is the second occurence of such a row in the data frame and its index can be easily obtained. How to obtain the index of first occurence of this row? Or, in other words, an index with which the duplicated row is identical?duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1]
df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))
duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1]
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
which(duplicated(df) | duplicated(df[nrow(df):1, ])[nrow(df):1])
#[1] 1 2 4 5 8 9 10
The complexity of the command can be reduced if fromLast = TRUE
is used as function argument. This is easier than creating two reversed vectors.duplicated(df) | duplicated(df, fromLast = TRUE)
duplicated(df) | duplicated(df, fromLast = TRUE)
#[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
How it works?
duplicated
is applied to both the original data frame and the data frame with reversed order of rows. The output of the latter is reversed again. Note that the first occurrences of duplicated values in the original data are the last occurrences in the reversed version. Afterwards, both vectors are combined using |
since a TRUE
in at least one of them indicates a duplicated value.