在一个data.frame中查找具有相同数据的列 [英] Find columns with same data in one data.frame
问题描述
正如@John所说,有问题使用复制
。我会补充说,将数据框架转换为与$ code>重复的进行比较之前,所有数据都将被转换成相同的数据类型。例如,这里是一个data.frame:
df< - data.frame(a = LETTERS [1:3 ],
b = 1:3,
c = as.character(1:3),
d = LETTERS [1:3],
e = 1:3,
f = 1:3)
df
#abcdef
#1 A 1 1 A 1 1
#2 B 2 2 B 2 2
#3 C 3 3 C 3 3
请注意,列 c
类似于列 b
, e
和 f
,但不同的类型(字符与数字)不同。 @Jubbles建议的解决方案将忽略这些差异。
相反,在数据框架的列上使用相同的
函数似乎更合适。您可以使用 outer
比较列两列:
.cols.identical< - function(col1,col2)same(df [,col1],df [,col2])
same.mat< - outer(colnames(df),colnames(df),
FUN = Vectorize(are.cols.identical))
same.mat
#[,1] [,2] [,3] [,4] [,5] [,6]
#[1,] TRUE FALSE FALSE TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE TRUE
#[3,] FALSE FALSE TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE TRUE FALSE FALSE
#[5,] FALSE TRUE FALSE FALSE TRUE TRUE
#[6,] FALSE TRUE FALSE FALSE TRUE TRUE
从这里,您可以使用聚类来识别相同列的组(可能有更好的方法,如果你知道一个,随时评论甚至甚至编辑我的答案。)
库(集群)
distance< - as.dist(!same.mat)
树< - hclust(distance)
cut< - cu树(树,h = 0.5)
cut
#[1] 1 2 3 1 2 2
split(colnames(df),cut)
#$ `1`
#[1]ad
#
#$`2`
#[1]bef
#
#$`3`
#[1]c
编辑1:以忽略浮点值的差异,可以使用
are.cols .identical< - function(col1,col2)isTRUE(all.equal((df [,col1],df [,col2]))
编辑2:比群集更加有效的方法来分组相同列的名称是
cut< - apply(same.mat,1,function(x)match(TRUE,x))
split(colnames(df),cut)
I have 1 data.frame named A, there are 5000 columns in it. How can I find columns in this data.frame that are equal to each other.
As @John mentioned, there are problems with using duplicated
. I would add that transposing the data.frame forces all the data into a same data type before it is even compared with duplicated
. On an example, here is a data.frame:
df <- data.frame( a = LETTERS[1:3],
b = 1:3,
c = as.character(1:3),
d = LETTERS[1:3],
e = 1:3,
f = 1:3)
df
# a b c d e f
# 1 A 1 1 A 1 1
# 2 B 2 2 B 2 2
# 3 C 3 3 C 3 3
Note that column c
is very similar to columns b
, e
, and f
, but not identical because of the different types (character versus numeric). The solution suggested by @Jubbles would disregard these differences.
Instead, it seems more appropriate to use the identical
function on the columns of your data.frame. You can compare columns two-by-two using outer
:
are.cols.identical <- function(col1, col2) identical(df[,col1], df[,col2])
identical.mat <- outer(colnames(df), colnames(df),
FUN = Vectorize(are.cols.identical))
identical.mat
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] TRUE FALSE FALSE TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE TRUE TRUE
# [3,] FALSE FALSE TRUE FALSE FALSE FALSE
# [4,] TRUE FALSE FALSE TRUE FALSE FALSE
# [5,] FALSE TRUE FALSE FALSE TRUE TRUE
# [6,] FALSE TRUE FALSE FALSE TRUE TRUE
From here, you can use clustering to identify groups of identical columns (there may be better ways so if you know one, feel free to comment or even edit my answer.)
library(cluster)
distances <- as.dist(!identical.mat)
tree <- hclust(distances)
cut <- cutree(tree, h = 0.5)
cut
# [1] 1 2 3 1 2 2
split(colnames(df), cut)
# $`1`
# [1] "a" "d"
#
# $`2`
# [1] "b" "e" "f"
#
# $`3`
# [1] "c"
Edit 1: to disregard differences in floating point values, one can use
are.cols.identical <- function(col1,col2) isTRUE(all.equal((df[,col1],df[,col2]))
Edit 2: a more efficient method than clustering for grouping the names of identical columns is
cut <- apply(identical.mat, 1, function(x)match(TRUE, x))
split(colnames(df), cut)
这篇关于在一个data.frame中查找具有相同数据的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!