在数据表中删除单个列的惯用法 [英] Idiom for dropping a single column in a data.table
问题描述
我需要从包含几百列的data.frame中删除一列。
使用 data.frame
,我会使用 subset
这样做方便:
> dat 。子集(dat,select = c(-z))
xy
1:0.1969049 a
2:0.7916696 a
3:0.9095970 b
4:0.3529506 b
5:0.4923602 c
6:0.5993034 c
7:0.1559861 d
8:0.9929333 d
9:0.3980169 e
10:0.1921226 e
显然这仍然有效,但似乎不是一个非常 data.table $ c喜欢成语。我可以手动构造一个列名,我想保留,这似乎有点多了
data.table
-like:
> dat [,list(x,y)]
xy
1:0.1969049 a
2:0.7916696 a
3:0.9095970 b
4:0.3529506 b
5:0.4923602 c
6:0.5993034 c
7:0.1559861 d
8:0.9929333 d
9:0.3980169 e
10:0.1921226 e
但是我必须构造一个这样的列表,这是笨重的。
subset
是方便删除一列还是两列的正确方法,还是会导致性能下降?
基准:
p> > dat = data.table(data.frame(x = runif(10 ^ 7),y = rep(letters [1:10],10 ^ 6),z = runif(10 ^ 7)),key = y')
> microbenchmark(subset(dat,select = c(-z)),dat [,list(x,y)])
单位:毫秒
expr min lq median uq max
1 dat [ ,list(x,y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2个子集(dat,select = c(-z))33.26356 52.55311 53.53934 55.00347 180.8740
但是,如果 subset
复制整个数据,真正可能更重要的是内存。如果您想要永久删除该列,请使用
。
:= NULL
dat [,z:= NULL]
如果你有你的列作为字符串使用()
toDrop <-c('z')
dat [,(toDrop):= NULL]
.SD
中的列,可以传递 .SDcols
参数
dat [,lapply(.SD,somefunction),.SDcols = setdiff(names(dat),'z')]
但是, data.table
检查 j
参数,只获取您使用任何方式的列。请参阅FAQ 1.12
当您写入X [Y,sum(foo * bar)]时,data.table
会自动检查j表达式,以查看它使用哪些列。
,不会尝试加载的所有数据。 SD
(除非您在 j
的呼叫中有 .SD
)
subset.data.table
正在处理调用并最终评估 dat [,c('x','y'),with = FALSE]
< := NULL 应该基本上是瞬时的,永远删除列。
I need to drop one column from a data.frame containing a few hundred columns.
With a data.frame
, I'd use subset
to do this conveniently:
> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
x y
1: 0.1969049 a
2: 0.7916696 a
3: 0.9095970 b
4: 0.3529506 b
5: 0.4923602 c
6: 0.5993034 c
7: 0.1559861 d
8: 0.9929333 d
9: 0.3980169 e
10: 0.1921226 e
Obviously this still works, but it seems like not a very data.table
-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table
-like:
> dat[,list(x,y)]
x y
1: 0.1969049 a
2: 0.7916696 a
3: 0.9095970 b
4: 0.3529506 b
5: 0.4923602 c
6: 0.5993034 c
7: 0.1559861 d
8: 0.9929333 d
9: 0.3980169 e
10: 0.1921226 e
But then I have to construct such a list, which is clunky.
Is subset
the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?
Edit
Benchmarks:
> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
expr min lq median uq max
1 dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z)) 33.26356 52.55311 53.53934 55.00347 180.8740
But really where it may matter more is for memory if subset
copies the whole data.table
.
If you are wanting to remove the column permanently use := NULL
dat[, z := NULL]
If you have your columns to drop as a character string use ()
to force evaluation as a character string, not as the character name.
toDrop <- c('z')
dat[, (toDrop) := NULL]
If you want to limit the availability of the columns in .SD
, you can pass the .SDcols
argument
dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]
However, data.table
inspects the j
arguments and only gets the columns you use any way. See FAQ 1.12
When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.
and doesn't try and load all the data for .SD
(unless you have .SD
within your call to j
)
subset.data.table
is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]
using := NULL
should be basically instantaneous, howveer t does permanently delete the column.
这篇关于在数据表中删除单个列的惯用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!