当使用.SDcols时,data.table可以处理相同的列名吗? [英] Can data.table handle identical column names when using .SDcols?

查看:261
本文介绍了当使用.SDcols时,data.table可以处理相同的列名吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当使用 .SD 将函数应用于 dt 的子集找到正确的方式来处理我重复的列名称的情况...例如

When using .SD to apply a function to a subset of dt's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... e.g.

#  Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
#   a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6

#  Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*`  , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
#    a  a
#1:  4  4
#2: 10 10
#3:  6  6
#4: 12 12

我不能让它与 .SDcols 是列名的字符向量,所以我尝试数字位置( which(names(dt)%in%a)给出一个向量 [1] 1 3 ),但它似乎只是乘以第一个 a 列。我做错了什么?

I couldn't get it to work with when .SDcols was a character vector of column names so I tried numeric positions (which( names(dt) %in% "a" ) gives a vector [1] 1 3 ) but it also seems to just multiply the first a column only. Am I doing something wrong?


.SDcols 高级。指定包含在.SD中的x列。可能是字符列名称或数字位置。

.SDcols Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions.

这些也返回与上面相同的结果...

These also returned the same result as above...

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]

packageVersion("data.table")
#[1] ‘1.8.11’


推荐答案

现在可以按照当前开发版本1.9.3 的要求。从新闻:



This now works as intended in the current development version 1.9.3. From NEWS:


data.tables 具有重复列的一致子集规则。简而言之,如果直接提供索引,j或在 .SDcols 中,那么只返回这些列(或者如果您提供 - .SDcols !j )。相反,如果给出列名,并且该列有多个出现,那么很难决定在子集上保留和删除哪个列。因此,要删除,删除该列的所有出现,并保持,每次总是返回第一列。同时关闭#5688 #5008 。注意,使用 by = 聚合在重复的列上可能不会给出预期的结果,因为它可能无法在正确的列上操作。

Consistent subset rules on data.tables with duplicate columns. In short, if indices are directly provided, 'j', or in .SDcols, then just those columns are either returned (or deleted if you provide -.SDcols or !j). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #5688 and #5008. Note that using by= to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.

基本上,如果你这样做:

Basically, if you do:

dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
#     a  a
# 1:  4  4
# 2: 10 10
# 3:  6  6
# 4: 12 12

但是,如果你明确指定(如你在你的Q):

But if you clearly specify (as you do in your Q):

dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
#     a  a
# 1:  4  8
# 2: 10  6
# 3:  6 12
# 4: 12  6

这篇关于当使用.SDcols时,data.table可以处理相同的列名吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆