为什么通过引用分配所有列时data.table转换列类 [英] Why is data.table casting column classes when I assign all columns by reference

查看:77
本文介绍了为什么通过引用分配所有列时data.table转换列类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 data.table
,这是我不了解的东西如果我选择一行,然后尝试将该行的所有值设置为 NA 将新的line-data.table强制转换为逻辑

Here is something I do not understand with data.table If I select a line and I try to set all values of this line to NA the new line-data.table is coerced to logical

#Here is a sample table
DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3))
DT
#    a   b  d
# 1: 1 1.1 aa
# 2: 1 1.1 aa
# 3: 1 1.1 aa

#Here I extract a line, all the column types are kept... good
str(DT[1])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: int 1
#  $ b: num 1.1
#  $ d: chr "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 

#Now here I want to set them all to `NA`...they all become logicals => WHY IS THAT ?
str(DT[1][,colnames(DT) := NA])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: logi NA
#  $ b: logi NA
#  $ d: logi NA
#  - attr(*, ".internal.selfref")=<externalptr> 

编辑:我认为这是一个错误,因为

I think it is a bug as

str(DT[1][ , a := NA])
# Classes ‘data.table’ and 'data.frame':  1 obs. of  3 variables:
#  $ a: logi NA
#  $ b: num 1.1
#  $ d: chr "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 

str(DT[1:2][ , a := NA])
# Classes ‘data.table’ and 'data.frame':  2 obs. of  3 variables:
#  $ a: int  NA NA
#  $ b: num  1.1 1.1
#  $ d: chr  "aa" "aa"
#  - attr(*, ".internal.selfref")=<externalptr> 


推荐答案

要提供答案,请从?:=


不同于<- data.frame 的$ c>,(可能较大的)LHS不被强制匹配(通常较小的)RHS的类型。相反,如果必要,RHS被强制匹配LHS的类型。如果这涉及将双精度值强制转换为整数列,则会发出警告(无论分数数据是否被截断)。这样做的动机是效率。最好预先使列类型正确并坚持下去。 可以更改列类型,但要刻意加倍:提供整个列作为RHS。然后将此RHS插入该列槽中,我们将其称为plonk语法,或者根据需要替换列语法。 通过构造新类型的全长向量,您作为用户可以更清楚地了解正在发生的事情,并且对于代码阅读者来说,您确实确实打算更改列类型

Unlike <- for data.frame, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then plonked into that column slot and we call this plonk syntax, or replace column syntax if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it's clearer to readers of your code that you really do intend to change the column type.

当然,这样做的动机是大表(例如RAM中有10GB)。不是1或2行表。

The motivation for all this is large tables (say 10GB in RAM), of course. Not 1 or 2 row tables.

简单地说:如果 length(RHS)== nrow(DT)然后将RHS(及其任何类型)插入到该列插槽中。即使这些长度为1,如果 length(RHS)< nrow(DT),该列(及其类型)的内存保留在原处,但RHS被强制并回收以 replace 中的(子集)

To put it more simply: if length(RHS) == nrow(DT) then the RHS (and whatever its type) is plonked into that column slot. Even if those lengths are 1. If length(RHS) < nrow(DT), the memory for the column (and its type) is kept in place, but the RHS is coerced and recycled to replace the (subset of) items in that column.

如果我需要在大表中更改列的类型,我会写:

If I need to change a column's type in a large table I write:

DT[, col := as.numeric(col)]

此处 as.numeric 分配一个新的向量,将 col强制到该新内存中,然后将其插入列槽。它尽可能高效。之所以这样,是因为 length(RHS)== nrow(DT)

here as.numeric allocates a new vector, coerces "col" into that new memory, which is then plonked into the column slot. It's as efficient as it can be. The reason that's a plonk is because length(RHS) == nrow(DT).

如果要覆盖具有不同默认值的其他类型的列:

If you want to overwrite a column with a different type containing some default value:

DT[, col := rep(21.5, nrow(DT))]    # i.e., deliberately harder

如果 col之前是整数类型,则它将更改键入每行包含21.5的数字。否则,仅 DT [,col:= 21.5] 会导致警告21.5被强制为21(除非DT只有1行!)

If "col" was type integer before, then it'll change to type numeric containing 21.5 for every row. Otherwise just DT[, col := 21.5] would result in a warning about 21.5 being coerced to 21 (unless DT is only 1 row!)

这篇关于为什么通过引用分配所有列时data.table转换列类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆