为什么当我通过引用分配所有列时data.table自动投射 [英] Why is data.table casting automatically when I assign all columns by reference

查看:107
本文介绍了为什么当我通过引用分配所有列时data.table自动投射的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我不明白的 data.table
如果我选择一行,我试图将此行的所有值设置为 NA 将新的line-data.table转换为逻辑

Here is something I do not understand with data.table If I select a line and I try to set all values of this line to NA the new line-data.table is casted to logical

#Here is a sample table
DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3))
DT
   a   b  d
1: 1 1.1 aa
2: 1 1.1 aa
3: 1 1.1 aa

#Here I extract a line, all the column types are kept... good
str(DT[1])
Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
 $ a: int 1
 $ b: num 1.1
 $ d: chr "aa"
 - attr(*, ".internal.selfref")=<externalptr> 

#Now here I want to set them all to NA...they all become logicals => WHY IS THAT ?
str(DT[1][,colnames(DT):=NA])
Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
 $ a: logi NA
 $ b: logi NA
 $ d: logi NA
 - attr(*, ".internal.selfref")=<externalptr> 

编辑:我认为这是一个错误

I think it is a bug as

R) str(DT[1][,a:=NA])
Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
 $ a: logi NA
 $ b: num 1.1
 $ d: chr "aa"
 - attr(*, ".internal.selfref")=<externalptr> 
R) str(DT[1:2][,a:=NA])
Classes ‘data.table’ and 'data.frame':  2 obs. of  3 variables:
 $ a: int  NA NA
 $ b: num  1.1 1.1
 $ d: chr  "aa" "aa"
 - attr(*, ".internal.selfref")=<externalptr> 


推荐答案

>?:=:

To provide an answer, from ?":=" :


LHS不强制匹配(通常小)RHS的类型。相反,如果需要,RHS被强制匹配LHS的类型。如果涉及到将双精度值强制转换为整数列,则会给出警告(无论是否删除小数数据)。这样做的动机是效率。最好先获得正确的列类型,并坚持使用。 更改列类型是可能的,但刻意更难:提供一个整列作为RHS。此RHS然后plonked到该列插槽,我们称之为plonk语法,或替换列语法,如果你喜欢。 由于需要构建一个新类型的全长向量,您作为用户更清楚发生了什么,并且更清楚您的代码的读者,您真的打算更改列类型。

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

当然,所有这些都是大表(比如10GB的RAM)。不是1或2行表。

The motivation for all this is large tables (say 10GB in RAM), of course. Not 1 or 2 row tables.

更简单:如果length(RHS)== nrow(DT) em> plonked 添加到该列位置。即使这些长度为1,如果长度(RHS) nrow(DT)时,列的存储器(及其类型)保持在适当位置,但是RHS被强制并再循环以替换该列中的项目的(子集)。

To put it more simply: if length(RHS) == nrow(DT) then the RHS (and whatever its type) is plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT) , the memory for the column (and its type) is kept in place, but the RHS is coerced and recycled to replace the (subset of) items in that column.

如果我需要在大表中更改列的类型,我写:

If I need to change a column's type in a large table I write :

DT[, col := as.numeric(col)]

这里 .nu​​meric 分配一个新的向量,coerces col到那个新的内存,然后plonked到列槽。它是高效的,因为它可以是。这是一个plonk的原因是因为length(RHS)== nrow(DT)。

here as.numeric allocates a new vector, coerces col into that new memory, which is then plonked into the column slot. It's as efficient as it can be. The reason that's a plonk is because length(RHS) == nrow(DT).

如果你想覆盖一个不同类型的列包含一些默认值: / p>

If you want to overwrite a column with a different type containing some default value :

DT[, col := rep(21.5,nrow(DT))]    # i.e., deliberately harder

如果col以前是类型整数,那么它将改为每行包含21.5的数字。否则只是 DT [,col:= 21.5] 会导致警告21.5被强制为21(除非DT只有1行!)

If col was type integer before, then it'll change to type numeric containing 21.5 for every row. Otherwise just DT[, col := 21.5] would result in a warning about 21.5 being coerced to 21 (unless DT is only 1 row!)

这篇关于为什么当我通过引用分配所有列时data.table自动投射的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆