根据 R 中的条件删除数据框的列 [英] Remove columns of dataframe based on conditions in R
问题描述
我必须删除包含超过 4000 列和 180 行的数据框中的列.我要设置以删除数据框中的列的条件是:(i) 如果该列中的值/条目少于两个,则删除该列(ii) 如果没有两个连续的列(一个接一个),则删除该列列中的值.(iii) 删除所有值为 NA 的列.我已经提供了删除列的条件.此处的目的不仅仅是像如何删除 data.table 中的列?"那样按名称查找列.我说明如下:
I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the column if there are less then two values/entries in that column (ii) Remove the column if there are no two consecutive(one after the other) values in the column. (iii) Remove the column having all values as NA. I have provided with conditions on which a column is to be deleted. The aim here is not just to find a column by its name like in "How do you delete a column in data.table?". I Illustrate as follows:
A B C D E
0.018 NA NA NA NA
0.017 NA NA NA NA
0.019 NA NA NA NA
0.018 0.034 NA NA NA
0.018 NA NA NA NA
0.015 NA NA NA 0.037
0.016 NA NA NA 0.031
0.019 NA 0.4 NA 0.025
0.016 0.03 NA NA 0.035
0.018 NA NA NA 0.035
0.017 NA NA NA 0.043
0.023 NA NA NA 0.040
0.022 NA NA NA 0.042
所需的数据框:
A E
0.018 NA
0.017 NA
0.019 NA
0.018 NA
0.018 NA
0.015 0.037
0.016 0.031
0.019 0.025
0.016 0.035
0.018 0.035
0.017 0.043
0.023 0.040
0.022 0.042
如何将这三个条件合并到一个代码中.我将感谢您在这方面的帮助.可重现的例子
How can I incoporate these three conditions in one code. I would appreciate your help in this regard. Reproducible example
structure(list(Month = c("Jan-2000", "Feb-2000", "Mar-2000",
"Apr-2000", "May-2000", "Jun-2000"), A.G.L.SJ.INVS...LON..DEAD...13.08.15 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), ABACUS.GROUP.DEAD...18.02.09 = c(0.00829384766220866,
0.00332213653674028, 0, 0, NA, NA), ABB.R..IRS. = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), .Names = c("Month",
"A.G.L.SJ.INVS...LON..DEAD...13.08.15", "ABACUS.GROUP.DEAD...18.02.09",
"ABB.R..IRS."), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x0000000001c90788>)
推荐答案
我觉得这一切都过于复杂了.条件 2 已经包含了所有其余的条件,好像一列中至少有两个非 NA
值,显然整列都不是 NA
.如果一列中至少有两个连续的值,那么显然该列包含多个值.因此,这不是 3 个条件,而是全部汇总为一个条件(我不希望每列运行许多函数,而是在每列运行 diff
之后 - 对整个事物进行矢量化):
I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA
values in a column, obviously the whole column aren't NA
s. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff
per column- vecotrize the whole thing):
cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
这是可行的,因为如果一列中没有连续的值,则整列将变为 NA
.
This works because if there are no consecutive values in a column, the whole column will become NA
s.
那么,就
df[, cond, drop = FALSE]
# A E
# 1 0.018 NA
# 2 0.017 NA
# 3 0.019 NA
# 4 0.018 NA
# 5 0.018 NA
# 6 0.015 0.037
# 7 0.016 0.031
# 8 0.019 0.025
# 9 0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042
<小时>
根据您的编辑,您似乎有一个 data.table
对象,并且您还有一个 Date
列,因此代码需要一些修改.
Per your edit, it seems like you have a data.table
object and you also have a Date
column so the code would need some modifications.
cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1]
df[, c(TRUE, cond), with = FALSE]
一些解释:
- 我们想忽略计算中的第一列,因此在对
.SD
进行操作时指定.SDcols = -1
(这意味着 Sub Data indata.table
is) .N
只是行数(类似于nrow(df)
- 下一步是按条件进行子集化.我们也不必忘记抓取第一列,所以我们从
c(TRUE,...
开始 - 最后,
data.table
默认使用非标准评估,因此,如果您想像在data.frame
中一样选择列,则需要指定with = FALSE
- We want to ignore the first column in our calculations so we specify
.SDcols = -1
when operating on our.SD
(which means Sub Data indata.table
is) .N
is just the rows count (similar tonrow(df)
- Next step is to subset by condition. We need not forget to grab the first column too so we start with
c(TRUE,...
- Finally,
data.table
works with non standard evaluation by default, hence, if you want to select column as if you would in adata.frame
you will need to specifywith = FALSE
不过,更好的方法是使用 := NULL
A better way though, would be just to remove the column by reference using := NULL
cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]
这篇关于根据 R 中的条件删除数据框的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!