在R中排序元组后,删除重复的元组 [英] Remove duplicate tuples after sorting the tuple in R

查看:248
本文介绍了在R中排序元组后,删除重复的元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个关于在R中的元组排序后删除重复的问题,假设我有一个值的数据框

  df< -cbind(c(1,2,7,8,5,1),c(5,6,3,4,1,8) c(1.2,1, - 。5,5,1.2,1))

a和b

  a = df [,1] 
b = df [,2]
temp <-cbind ,b)

我正在做的是基于一个排序的元组的独占。例如,我想保留a = 1,2,7,8,1和b = 5,6,3,4,8,条目a [5]和b [5]被删除。这基本上是为了确定两个对象之间的交互。 1 vs 5,2 vs 6等,但5 vs 1与1对5相同,因此我想删除它。



我开始采用的路由如下。我创建了一个函数来对每个元素进行排序,并将结果重新放入一个向量中。

  sortme< -function(i) {sort(temp [i,])} 
已排序< -t(sapply(1:nrow(temp),sortme))

并得到以下结果

  ab 
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 5
[6,] 1 8

然后我排序结果独特

 唯一(排序)

其中

  ab 
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 8

然后使用!重复以获取我/我可以在原始数据集中使用的真/假结果列表,从另一个单独的列中提取值。

  T_F<  - !重复(排序)
final_df< -df [T_F,]

我想知道的是,如果我正在这样做一个非常大的数据集的正确方法,或者如果有一个建立在功能上已经这样做了。

解决方案

根据一个非常大的数据集的含义,您可以通过将排序功能应用于这些行的总和重复。

  theSums<  - 。rowSums(temp,m = nrow(temp),n = ncol temp))

almostSorted< - do.call(rbind,tapply(seq_len(nrow(temp)),theSums,
函数(x){
if(length x)== 1L){
return(cbind(x,temp [x,,drop = FALSE]))
} else {
return(cbind(x,t(apply [x,],1,sort))))
}
}
))

(已排序的< - almostSorted [order(almostSorted [,1] ),-1])$ ​​b
$ b [1,] 1 5
[2,] 2 6
[3,] 7 3
[4,] 8 4
[5,] 1 5
[6,] 1 8


I have a question regarding removing duplicates after sorting within a tuple in R.

Let's say I have a dataframe of values

df<-cbind(c(1,2,7,8,5,1),c(5,6,3,4,1,8),c(1.2,1,-.5,5,1.2,1))

a and b

a=df[,1]
b=df[,2]
temp<-cbind(a,b)

What I am doing is uniquing based upon a sorted tuple. For example, I want to keep a=1,2,7,8,1 and b=5,6,3,4,8 with the entry a[5] and b[5] removed. This is basically for determining interactions between two objects. 1 vs 5, 2 vs 6 etc. but 5 vs 1 is the same as 1 vs 5, hence I want to remove it.

The route I started to take was as follows. I created a function that sorts each element and put the results back into a vector as such.

sortme<-function(i){sort(temp[i,])}
sorted<-t(sapply(1:nrow(temp),sortme))

and got the following results

     a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 5
[6,] 1 8

I then unique the sorted result

unique(sorted)

which gives

     a b
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
[5,] 1 8

I also then use !duplicated to get a list of true/false results that I can use in my original dataset to pull out values from another separate column.

T_F<-!duplicated(sorted)
final_df<-df[T_F,]

What I want to know is if I'm going about this the right way for a very large dataset or if there is a built in function to do this already.

解决方案

Depending on what you mean by "a very large dataset", you might gain some speed by applying the sorting function only to those rows whose sums are duplicated.

theSums<-.rowSums(temp,m=nrow(temp),n=ncol(temp))

almostSorted <- do.call(rbind, tapply(seq_len(nrow(temp)), theSums,
  function(x) {
    if(length(x) == 1L) {
      return(cbind(x, temp[x, , drop = FALSE]))
    } else {
      return(cbind(x, t(apply(temp[x, ], 1, sort))))
    }
  }
))

(sorted <- almostSorted[order(almostSorted[, 1]), -1])

[1,] 1 5
[2,] 2 6
[3,] 7 3
[4,] 8 4
[5,] 1 5
[6,] 1 8

这篇关于在R中排序元组后,删除重复的元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆