R枚举具有唯一值的数据框中的重复项 [英] R enumerate duplicates in a dataframe with unique value

查看:35
本文介绍了R枚举具有唯一值的数据框中的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一组零件和测试结果的数据框.零件在3个地点(北部中心和南部)进行了测试.有时,这些零件需要重新测试.我想最终创建一些图表,以比较第一次测试零件与第二次(或第三次等)测试结果的结果,例如查看测试仪的可重复性.

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.

作为一个例子,我想出了下面的代码.我已从morley数据集中明确删除了"Experiment"列,因为这是我正在有效地尝试重新创建的列.解决此问题的方法.有什么想法吗?

As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?

编辑-我意识到给出的示例对于我的实际需求过于简单(我试图尽可能容易地生成可复制的示例).

Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).

新示例:

part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)

data<-data.frame(part,site,result)
data$index<-1
repeat {
    if(!anyDuplicated(data[,c("part","site","index")]))
    { break }
    data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data

      part site result index
1     A    N     17     1
2     A    C     20     1
3     A    S     25     1
4     B    C     51     1
5     B    N     50     1
6     B    S     49     1
7     A    N     43     2
8     A    C     45     2
9     A    S     47     2
10    C    N     52     1
11    C    S     51     1
12    C    C     56     1

旧示例:

#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]

#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1

# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
    if(!anyDuplicated(df[,c(1,3)]))
    { break }
    df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}

# Check - The below vector should all be true
df$index==morley$Expt

推荐答案

我们可以在运行"列上使用 diff cumsum 来获得预期的输出.在这种方法中,我们不会创建1列(即索引"),也不会假设运行"中的序列按OP的示例所示进行排序.

We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.

indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE

或者我们可以使用 ave

indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE

更新

使用新示例

with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1

或者我们可以使用 library(splitstackshape)

library(splitstackshape)
getanID(data, c('part', 'site'))[]

这篇关于R枚举具有唯一值的数据框中的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆