当表被“复制”时,data.table中的二级密钥(“索引”属性)通过选择列 [英] secondary key ("index" attribute) in data.table is lost when table is "copied" by selecting columns
问题描述
我有一个data.table myDT
,并且我通过三种不同的方式制作此表的副本:
myDT < - data.table(colA = 1:3)
myDT [colA == 3]
copy1< - copy(myDT)
copy2< - myDT#是我知道它是一个引用,不是真正的副本
copy3< - myDT [,。(colA)]#表
然后我将这些副本与原始表格进行比较:
完全相同(myDT,copy1)
#TRUE
相同(myDT,copy2)
#TRUE
相同myDT,copy3)
#FALSE
我试图找出 myDT
和 copy3
same(names(myDT),names(copy3))
#TRUE
all.equal(myDT,copy3,check.attributes = FALSE)
#TRUE
all .equal(myDT,copy3,check.attributes = FALSE,trim.levels = FALSE,check.names = TRUE)
#TRUE
attr.all.equal(myDT,copy3,check.attributes = FALSE ,trim.levels = FALSE,check.names = TRUE)
#NULL
all.equal(myDT,copy3)
#[1]Attributes:长度不匹配:在前1个组件上的比较>
attr.all.equal(myDT,copy3)
#[1]属性:名称:1个字符串不匹配>
#[2]属性:长度不匹配:前3个分量上的比较>
#[3]属性:组件3:属性:模式:list,NULL> >
#[4]属性:组件3:属性:目标的名称,但不是当前的> >
#[5]属性:组件3:属性:当前不是列表式> >
#[6]属性:组件3:数字:长度(0,3)不同>
最后我来到使用 attributes()
函数:
attr0 < - attributes(myDT)
attr3 < - attributes(copy3)
str(attr0)
str(attr3)
它表明原始
data.table
有一个code>
解决方案为了使这个问题更清楚(对未来的读者来说可能有用),这里真正发生的是,你可能不设置辅助键,同时显式调用
set2key
,OR,data.table
似乎设置了一个辅助键这是V 1.9.4中添加的(不是这样)新功能
DT [column == value]现在已经优化了使用键(DT)[1] ==column时使用
DT的键的DT [%]值,
index)会自动添加,所以下一个DT [column == value]的速度就快
。不需要更改代码;现有代码应该自动
获益。可以使用set2key()手动添加辅助键,使用key2()选择
存在。这些优化和函数
names / arguments是实验性的,可以通过
选项(datatable.auto.index = FALSE)关闭。
让我们重现这个
myDT < - data.table(A = 1:3)
options(datatable.verbose = TRUE)
myDT [A == 3]
# ~~~这里是
#forder占用0秒
#强制双列i.'V1'为整数以匹配x.'A'的类型。请避免强制提高效率。
#开始bmerge ...在0秒内完成
#A
#1:3
attr(myDT,index)#或使用`key2 myDT)`
#integer(0)
#attr(,__ A)
#integer(0)
因此,与您假设不同的是,您实际上 创建了副本,因此辅助键未随其传输。比较
copy1< - myDT
attr(copy1,index)
#integer )
#检查j是否使用这些列:
#attr(,__ A)
#integer(0)
copy2 < A< ~~~这是复制发生的地方
attr(copy2,index)
#NULL
identical(myDT,copy1)
# 1] TRUE
identical(myDT,copy2)
#[1] FALSE
tracemem(myDT)
#[1]< 00000000159CBBB0>
tracemem(copy1)
#[1]< 00000000159CBBB0>
tracemem(copy2)
#[1]< 000000001A5A46D8>
这里最有趣的结论,即使对象保持不变,
[。data.table
也会创建副本。I have a data.table
myDT
, and I'm making "copies" of this table by 3 different ways:myDT <- data.table(colA = 1:3) myDT[colA == 3] copy1 <- copy(myDT) copy2 <- myDT # yes I know that it's a reference, not real copy copy3 <- myDT[,.(colA)] # I list all columns from the original table
Then I'm comparing those copies with the original table:
identical(myDT, copy1) # TRUE identical(myDT, copy2) # TRUE identical(myDT, copy3) # FALSE
I was trying to figure out what was the difference between
myDT
andcopy3
identical(names(myDT), names(copy3)) # TRUE all.equal(myDT, copy3, check.attributes=FALSE) # TRUE all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE) # TRUE attr.all.equal(myDT, copy3, check.attributes=FALSE, trim.levels=FALSE, check.names=TRUE) # NULL all.equal(myDT, copy3) # [1] "Attributes: < Length mismatch: comparison on first 1 components >" attr.all.equal(myDT, copy3) # [1] "Attributes: < Names: 1 string mismatch >" # [2] "Attributes: < Length mismatch: comparison on first 3 components >" # [3] "Attributes: < Component 3: Attributes: < Modes: list, NULL > >" # [4] "Attributes: < Component 3: Attributes: < names for target but not for current > >" # [5] "Attributes: < Component 3: Attributes: < current is not list-like > >" # [6] "Attributes: < Component 3: Numeric: lengths (0, 3) differ >"
My original question was how to understand the last output. Finally I came to using the
attributes()
function:attr0 <- attributes(myDT) attr3 <- attributes(copy3) str(attr0) str(attr3)
it has shown that original
data.table
had anindex
attribute which was not copied when I createdcopy3
.解决方案In order to make this question a bit clearer (and maybe useful for future readers), what really happened here is that you (probably not) set a secondary key while explicitly calling
set2key
, OR,data.table
seemingly set a secondary key while you were making some ordinary operations such as filtering. This is a (not so) new feature added in V 1.9.4DT[column==value] and DT[column %in% values] are now optimized to use DT's key when key(DT)[1]=="column", otherwise a secondary key (a.k.a. index) is automatically added so the next DT[column==value] is much faster. No code changes are needed; existing code should automatically benefit. Secondary keys can be added manually using set2key() and existence checked using key2(). These optimizations and function names/arguments are experimental and may be turned off with options(datatable.auto.index=FALSE).
Lets reproduce this
myDT <- data.table(A = 1:3) options(datatable.verbose = TRUE) myDT[A == 3] # Creating new index 'A' <~~~~ Here it is # forder took 0 sec # Coercing double column i.'V1' to integer to match type of x.'A'. Please avoid coercion for efficiency. # Starting bmerge ...done in 0 secs # A # 1: 3 attr(myDT, "index") # or using `key2(myDT)` # integer(0) # attr(,"__A") # integer(0)
So, unlike you were assuming, you actually did create a copy and thus the secondary key wasn't transferred with it. Compare
copy1 <- myDT attr(copy1, "index") # integer(0) # attr(,"__A") # integer(0) copy2 <- myDT[,.(A)] # Detected that j uses these columns: A <~~~ This is where the copy occures attr(copy2, "index") # NULL identical(myDT, copy1) # [1] TRUE identical(myDT, copy2) # [1] FALSE
And for some further validation
tracemem(myDT) # [1] "<00000000159CBBB0>" tracemem(copy1) # [1] "<00000000159CBBB0>" tracemem(copy2) # [1] "<000000001A5A46D8>"
The most interesting conclusion here, one could claim, that
[.data.table
does create a copy, even if the object remains unchanged.这篇关于当表被“复制”时,data.table中的二级密钥(“索引”属性)通过选择列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!