识别R数据帧中的重复列 [英] Identifying duplicate columns in an R data frame

查看:197
本文介绍了识别R数据帧中的重复列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个R新手,正在尝试从大量数据帧(50K行,215列)中删除重复的列。框架具有离散的连续和分类变量的混合。



我的方法是将框架中每列的表格生成一个列表,然后使用 duplicateated()函数在列表中查找重复的行,如下所示:

  age = 18:29 
height = c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender = c(M ,F,M,M,F,F,M,M,F,M,F,M = data.frame(age = age,height = height,height2 = height,gender = gender,gender2 = gender)

tables = apply(testframe,2,table)
dups = (复制(表))
testframe< - 子集(testframe,select = -c(dups))

这不是非常有效率,特别是对于大型连续变量。但是,我已经下了这条路线,因为我无法使用摘要获得相同的结果(请注意,以下假定原始 testframe 包含重复项):
$ b

  summaries = apply(testframe,2,summary)
dups = which(duplicated(summaries))
testframe < - 子集(testframe,select = -c(dups))

如果运行该代码看到它只会删除找到的第一个重复。我认为这是因为我做错了事情。任何人都可以指出我出错的地方,还是更好的指出我从更好的方式去清除数据框中的重复列?

解决方案

您可以使用 lapply

  testframe [!duplicateated(lapply(testframe,summary))] 

code>在忽略订单时总结分发。



不是100%,但如果数据很大,我会使用摘要:

 库(摘要)
testframe [!duplicateated(lapply(testframe,digest))]
pre>

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

解决方案

You can do with lapply:

testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]

这篇关于识别R数据帧中的重复列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆