在一个data.frame中查找具有相同数据的列 [英] Find columns with same data in one data.frame

查看:219
本文介绍了在一个data.frame中查找具有相同数据的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为A的data.frame,其中有5000列。如何在这个数据框架中找到彼此相等的列。

解决方案

正如@John所说,有问题使用复制。我会补充说,将数据框架转换为与$ code>重复的进行比较之前,所有数据都将被转换成相同的数据类型。例如,这里是一个data.frame:

  df<  -  data.frame(a = LETTERS [1:3 ],
b = 1:3,
c = as.character(1:3),
d = LETTERS [1:3],
e = 1:3,
f = 1:3)
df
#abcdef
#1 A 1 1 A 1 1
#2 B 2 2 B 2 2
#3 C 3 3 C 3 3

请注意,列 c 类似于列 b e f ,但不同的类型(字符与数字)不同。 @Jubbles建议的解决方案将忽略这些差异。



相反,在数据框架的列上使用相同的函数似乎更合适。您可以使用 outer 比较列两列:

  .cols.identical<  -  function(col1,col2)same(df [,col1],df [,col2])
same.mat< - outer(colnames(df),colnames(df),
FUN = Vectorize(are.cols.identical))
same.mat
#[,1] [,2] [,3] [,4] [,5] [,6]
#[1,] TRUE FALSE FALSE TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE TRUE
#[3,] FALSE FALSE TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE TRUE FALSE FALSE
#[5,] FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE TRUE FALSE FALSE TRUE TRUE

从这里,您可以使用聚类来识别相同列的组(可能有更好的方法,如果你知道一个,随时评论甚至甚至编辑我的答案。)

 库(集群)
distance< - as.dist(!same.mat)
树< - hclust(distance)
cut< - cu树(树,h = 0.5)
cut
#[1] 1 2 3 1 2 2

split(colnames(df),cut)
#$ `1`
#[1]ad

#$`2`
#[1]bef

#$`3`
#[1]c

编辑1:以忽略浮点值的差异,可以使用

  are.cols .identical<  -  function(col1,col2)isTRUE(all.equal((df [,col1],df [,col2]))

编辑2:比群集更加有效的方法来分组相同列的名称是

  cut<  -  apply(same.mat,1,function(x)match(TRUE,x))
split(colnames(df),cut)


I have 1 data.frame named A, there are 5000 columns in it. How can I find columns in this data.frame that are equal to each other.

解决方案

As @John mentioned, there are problems with using duplicated. I would add that transposing the data.frame forces all the data into a same data type before it is even compared with duplicated. On an example, here is a data.frame:

df <- data.frame( a = LETTERS[1:3],
                  b = 1:3,
                  c = as.character(1:3),
                  d = LETTERS[1:3],
                  e = 1:3,
                  f = 1:3)
df
#   a b c d e f
# 1 A 1 1 A 1 1
# 2 B 2 2 B 2 2
# 3 C 3 3 C 3 3

Note that column c is very similar to columns b, e, and f, but not identical because of the different types (character versus numeric). The solution suggested by @Jubbles would disregard these differences.

Instead, it seems more appropriate to use the identical function on the columns of your data.frame. You can compare columns two-by-two using outer:

are.cols.identical <- function(col1, col2) identical(df[,col1], df[,col2])
identical.mat      <- outer(colnames(df), colnames(df),
                            FUN = Vectorize(are.cols.identical))
identical.mat
# [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# [1,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [3,] FALSE FALSE  TRUE FALSE FALSE FALSE
# [4,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [5,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [6,] FALSE  TRUE FALSE FALSE  TRUE  TRUE

From here, you can use clustering to identify groups of identical columns (there may be better ways so if you know one, feel free to comment or even edit my answer.)

library(cluster)
distances <- as.dist(!identical.mat)
tree      <- hclust(distances)
cut       <- cutree(tree, h = 0.5)
cut
# [1] 1 2 3 1 2 2

split(colnames(df), cut)
# $`1`
# [1] "a" "d"
# 
# $`2`
# [1] "b" "e" "f"
# 
# $`3`
# [1] "c"

Edit 1: to disregard differences in floating point values, one can use

are.cols.identical <- function(col1,col2) isTRUE(all.equal((df[,col1],df[,col2]))

Edit 2: a more efficient method than clustering for grouping the names of identical columns is

cut <- apply(identical.mat, 1, function(x)match(TRUE, x))
split(colnames(df), cut)

这篇关于在一个data.frame中查找具有相同数据的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆