在数据表中删除单个列的惯用法 [英] Idiom for dropping a single column in a data.table

查看:139
本文介绍了在数据表中删除单个列的惯用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从包含几百列的data.frame中删除一列。



使用 data.frame ,我会使用 subset 这样做方便:

 > dat 。子集(dat,select = c(-z))
xy
1:0.1969049 a
2:0.7916696 a
3:0.9095970 b
4:0.3529506 b
5:0.4923602 c
6:0.5993034 c
7:0.1559861 d
8:0.9929333 d
9:0.3980169 e
10:0.1921226 e

显然这仍然有效,但似乎不是一个非常 data.table data.table -like:

 > dat [,list(x,y)] 
xy
1:0.1969049 a
2:0.7916696 a
3:0.9095970 b
4:0.3529506 b
5:0.4923602 c
6:0.5993034 c
7:0.1559861 d
8:0.9929333 d
9:0.3980169 e
10:0.1921226 e

但是我必须构造一个这样的列表,这是笨重的。



subset 是方便删除一列还是两列的正确方法,还是会导致性能下降?



基准:

p>

 > dat = data.table(data.frame(x = runif(10 ^ 7),y = rep(letters [1:10],10 ^ 6),z = runif(10 ^ 7)),key = y')
> microbenchmark(subset(dat,select = c(-z)),dat [,list(x,y)])
单位:毫秒
expr min lq median uq max
1 dat [ ,list(x,y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2个子集(dat,select = c(-z))33.26356 52.55311 53.53934 55.00347 180.8740

但是,如果 subset 复制整个数据,真正可能更重要的是内存。如果您想要永久删除该列,请使用

解决方案

:= NULL

  dat [,z:= NULL] 

如果你有你的列作为字符串使用()

  toDrop <-c('z')

dat [,(toDrop):= NULL]

.SD 中的列,可以传递 .SDcols 参数

  dat [,lapply(.SD,somefunction),.SDcols = setdiff(names(dat),'z')] 

但是, data.table 检查 j 参数,只获取您使用任何方式的列。请参阅FAQ 1.12


当您写入X [Y,sum(foo * bar)]时,data.table
会自动检查j表达式,以查看它使用哪些列。


,不会尝试加载的所有数据。 SD (除非您在 j 的呼叫中有 .SD






subset.data.table 正在处理调用并最终评估 dat [,c('x','y'),with = FALSE]



< := NULL
应该基本上是瞬时的,永远删除列。


I need to drop one column from a data.frame containing a few hundred columns.

With a data.frame, I'd use subset to do this conveniently:

> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

Obviously this still works, but it seems like not a very data.table-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table-like:

> dat[,list(x,y)]
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

But then I have to construct such a list, which is clunky.

Is subset the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?

Edit

Benchmarks:

> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
                         expr       min        lq    median        uq      max
1           dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z))  33.26356  52.55311  53.53934  55.00347 180.8740

But really where it may matter more is for memory if subset copies the whole data.table.

解决方案

If you are wanting to remove the column permanently use := NULL

dat[, z := NULL]

If you have your columns to drop as a character string use () to force evaluation as a character string, not as the character name.

toDrop <- c('z')

dat[, (toDrop) := NULL]

If you want to limit the availability of the columns in .SD, you can pass the .SDcols argument

dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]

However, data.table inspects the j arguments and only gets the columns you use any way. See FAQ 1.12

When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.

and doesn't try and load all the data for .SD (unless you have .SD within your call to j)


subset.data.table is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]

using := NULL should be basically instantaneous, howveer t does permanently delete the column.

这篇关于在数据表中删除单个列的惯用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆