基于R中的条件删除数据帧的列 [英] Remove columns of dataframe based on conditions in R

查看:97
本文介绍了基于R中的条件删除数据帧的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须删除我的数据帧中有超过4000列和180行的列。我想要设置为删除数据框中的列的条件是:
(i)删除列如果有少那么该列中的两个值/条目
(ii)如果列中没有两个连续(一个接一个)
值,则删除列。
(iii)删除所有值为NA的列。
我提供了要删除列的条件。这里的目的不仅仅是通过它的名称找到列,如如何删除数据表中的列?。
I说明如下:

  ABCDE 
0.018 NA NA NA NA
0.017 NA NA NA NA
0.019 NA NA NA NA
0.018 0.034 NA NA NA
0.018 NA NA NA NA
0.015 NA NA NA 0.037
0.016 NA NA NA 0.031
0.019 NA 0.4 NA 0.025
0.016 0.03 NA NA 0.035
0.018 NA NA NA 0.035
0.017 NA NA NA 0.043
0.023 NA NA NA 0.040
0.022 NA NA NA 0.042

所需数据框架:

  AE 
0.018 NA
0.017 NA
0.019 NA
0.018 NA
0.018 NA
0.015 0.037
0.016 0.031
0.019 0.025
0.016 0.035
0.018 0.035
0.017 0.043
0.023 0.040
0.022 0.042

如何在一个代码中包含这三个条件。我很感谢您在这方面的帮助。
可复制的示例

 结构(列表(Month = c(Jan-2000,Feb-2000 Mar-2000,
2000年4月,May-2000,Jun-2000),AGLS.INVS ... LON..DEAD ... 13.08.15 = c(NA_real_ ,
NA_real_,NA_real_,NA_real_,NA_real_,NA_real_),ABACUS.GROUP.DEAD ... 18.02.09 = c(0.00829384766220866,
0.00332213653674028,0,0,NA,NA) ..IRS。= c(NA_real_,
NA_real_,NA_real_,NA_real_,NA_real_,NA_real_)).Names = c(Month,
AGLSJ.INVS ... LON..DEAD ... 13.08.15,ABACUS.GROUP.DEAD ... 18.02.09,
ABB.R..IRS。),class = c(data.table frame),row.names = c(NA,
-6L),.internal.selfref =< pointer:0x0000000001c90788>)

解决方案

我觉得这是非常复杂的。条件2已经包括所有其余条件,如同一列中至少有两个非 NA 值,显然整个列不是 NA 。如果列中至少有两个连续的值,那么显然该列包含多个值。所以,而不是3条件,这总共是一个条件(我不喜欢运行许多函数每列,而是运行 diff 每列 - vecotrize的整个事情) :

  cond<  -  colSums(is.na(sapply(df,diff))))& nrow(df) -  1 

这是因为如果列中没有连续的值,列将变为 NA



然后,只需

  df [,cond,drop = FALSE ] 
#AE
#1 0.018 NA
#2 0.017 NA
#3 0.019 NA
#4 0.018 NA
#5 0.018 NA
#6 0.015 0.037
#7 0.016 0.031
#8 0.019 0.025
#9 0.016 0.035
#10 0.018 0.035
#11 0.017 0.043
# 12 0.023 0.040
#13 0.022 0.042






根据您的编辑,您似乎有一个 data.table 对象,您还有一个日期 column,因此代码需要修改。

  cond<  -  df [,lapply (x)sum(is.na(diff(x)))) .N-1,.SDcols = -1] 
df [,c(TRUE,cond),with = FALSE]

$ b b

一些说明:








A更好的方法,只是通过使用:= NULL

 cond <-c(FALSE,df [,lapply(.SD,function(x)sum(is.na(diff(x))))= .N-1,.SDcols = -1] )
df [,which(cond):= NULL]


I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows:

A       B    C   D  E
0.018  NA    NA  NA NA
0.017  NA    NA  NA NA
0.019  NA    NA  NA NA
0.018  0.034 NA  NA NA
0.018  NA    NA  NA NA
0.015  NA    NA  NA 0.037
0.016  NA    NA  NA 0.031
0.019  NA    0.4 NA 0.025
0.016  0.03  NA  NA 0.035
0.018  NA    NA  NA 0.035
0.017  NA    NA  NA 0.043
0.023  NA    NA  NA 0.040
0.022  NA    NA  NA 0.042

Desired dataframe:

A       E
0.018   NA
0.017   NA
0.019   NA
0.018   NA
0.018   NA
0.015   0.037
0.016   0.031
0.019   0.025
0.016   0.035
0.018   0.035
0.017   0.043
0.023   0.040
0.022   0.042

How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example

structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000", 
"Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866, 
0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month", 
"A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09", 
"ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x0000000001c90788>)

解决方案

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1

This works because if there are no consecutive values in a column, the whole column will become NAs.

Then, just

df[, cond, drop = FALSE]
#        A     E
# 1  0.018    NA
# 2  0.017    NA
# 3  0.019    NA
# 4  0.018    NA
# 5  0.018    NA
# 6  0.015 0.037
# 7  0.016 0.031
# 8  0.019 0.025
# 9  0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042


Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]

Some explanations:

  • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
  • .N is just the rows count (similar to nrow(df)
  • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
  • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

A better way though, would be just to remove the column by reference using := NULL

cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]

这篇关于基于R中的条件删除数据帧的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆