识别R数据帧中的重复列 [英] Identifying duplicate columns in an R data frame
问题描述
我的方法是将框架中每列的表格生成一个列表,然后使用 duplicateated()
函数在列表中查找重复的行,如下所示:
age = 18:29
height = c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender = c(M ,F,M,M,F,F,M,M,F,M,F,M = data.frame(age = age,height = height,height2 = height,gender = gender,gender2 = gender)
tables = apply(testframe,2,table)
dups = (复制(表))
testframe< - 子集(testframe,select = -c(dups))
这不是非常有效率,特别是对于大型连续变量。但是,我已经下了这条路线,因为我无法使用摘要获得相同的结果(请注意,以下假定原始 testframe
包含重复项):
$ b
summaries = apply(testframe,2,summary)
dups = which(duplicated(summaries))
testframe < - 子集(testframe,select = -c(dups))
如果运行该代码看到它只会删除找到的第一个重复。我认为这是因为我做错了事情。任何人都可以指出我出错的地方,还是更好的指出我从更好的方式去清除数据框中的重复列?
您可以使用 lapply
:
testframe [!duplicateated(lapply(testframe,summary))]
code>在忽略订单时总结分发。
不是100%,但如果数据很大,我会使用摘要:
库(摘要)
pre>
testframe [!duplicateated(lapply(testframe,digest))]
I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the
duplicated()
function to find rows in the list that are duplicates, as follows:age=18:29 height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5) gender=c("M","F","M","M","F","F","M","M","F","M","F","M") testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender) tables=apply(testframe,2,table) dups=which(duplicated(tables)) testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original
testframe
containing duplicates):summaries=apply(testframe,2,summary) dups=which(duplicated(summaries)) testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
解决方案You can do with
lapply
:testframe[!duplicated(lapply(testframe, summary))]
summary
summarizes the distribution while ignoring the order.Not 100% but I would use digest if the data is huge:
library(digest) testframe[!duplicated(lapply(testframe, digest))]
这篇关于识别R数据帧中的重复列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!