识别数据框中的重复列 [英] Identifying duplicate columns in a dataframe

查看:65
本文介绍了识别数据框中的重复列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R新手,正在尝试从较大的数据框中删除重复的列(5万行,215列)。框架混合了离散的连续变量和分类变量。

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

我的方法是为框架中的每一列生成一张表,然后再使用 duplicated()函数在列表中查找重复的行,如下所示:

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

这不是很有效,特别是对于大的连续变量。但是,我沿着这条路线走了,因为我无法使用摘要获得相同的结果(请注意,以下内容假设原始的 testframe 包含重复项):

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

如果您运行该代码,会看到它只会删除找到的第一个重复项。我想这是因为我做错了。谁能指出我出了问题的地方,或者甚至更好地指出我要从一种更好的方法中删除数据帧中重复列的方向?

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

推荐答案

您可以使用 lapply

testframe[!duplicated(lapply(testframe, summary))]

摘要在忽略顺序的同时总结了分布。

summary summarizes the distribution while ignoring the order.

不是100%,但如果数据量很大,我会使用摘要:

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]

这篇关于识别数据框中的重复列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆