你可以用 data.frame 做什么而不能用 data.table 做什么? [英] What you can do with a data.frame that you can't with a data.table?
问题描述
我刚开始使用 R,遇到了 data.table.我发现它很棒.
一个非常幼稚的问题:我可以忽略data.frame来使用data.table来避免两个包之间的语法混淆吗?
FAQ 1.8 好的,我开始了解 data.table 是关于什么的,但是你为什么不在 R 中增强 data.frame 呢?为什么它必须是一个新包?
<块引用>正如 FAQ 1.1 强调的那样,[.data.table
中的 j
基本上是不同于 [.data.frame
中的 j
.甚至像这样简单的事情DF[,1]
会破坏许多包和用户代码中的现有代码.这是设计使然,我们希望它以这种方式工作更多复杂的语法工作.还有其他差异(请参阅常见问题解答2.17).
此外,data.table
继承自 data.frame
.它是一个data.frame
也是.data.table
可以传递给任何包只接受 data.frame
并且该包可以使用 [.data.frame
data.table
上的语法.
我们也尽可能地提出了对 R 的增强.之一这些在 R 2.12.0 中被接受为新功能:
<块引用>unique()
和 match()
现在在所有元素都在全局 CHARSXP
缓存中且未标记的字符向量上更快编码(ASCII).感谢 Matthew Dowle 提出改进建议unique.
c.
第二个建议是在 duplicate.c
中使用 memcpy
,这很比 C 中的 for 循环更快.这将改进 R 复制的方式内部数据(在某些措施上是 13 倍).r-devel 上的线程在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
data.frame
和 data.table 之间较小的语法差异是什么
<块引用>DT[3]
指的是第三行,而DF[3]
指的是第三列DT[3, ] == DT[3]
,但DF[ , 3] == DF[3]
(在 data.frame 中有些混乱,而data.table 是一致的)- 因此,我们说逗号在
DT
中是 可选,但在DF
中不是可选 DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
,其中i
是单个整数,返回单行,就像DF[i, ]
,但与返回向量的矩阵单行子集不同.DT[ , j]
其中j
是单个整数,返回单列 data.table,与DF[, j]
不同默认返回一个向量DT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(目前在 data.table v1.9.8 中,但即将更改,请参阅发行说明)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
返回 1 行NA
,但DF[NA]
返回DF
的完整副本code> 始终包含NA
.符号NA
是 R 中的logical
类型,因此被[.data.frame
回收.用户的意图可能是DF[NA_integer_]
.为方便起见,[.data.table
自动转向这个可能的意图.DT[c(TRUE, NA, FALSE)]
将NA
视为FALSE
,但DF[c(TRUE, NA, FALSE)]
返回每个NA
的 DT[ColA == ColB]
比DF[!is.na(ColA) &!is.na(ColB) &ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
创建 3 列,data.table 创建list
列.李>check.names
在data.frame
中默认为TRUE
而在 data.table 中默认为FALSE
,例如方便.stringsAsFactors
在data.frame
中默认为TRUE
而在 data.table 中默认为FALSE
,以提高效率.由于在 R 中添加了全局字符串缓存,字符项是指向单个缓存字符串的指针,转换为factor
不再有性能优势.list
列中的原子向量在使用", "
indata.frame
打印时会折叠,但","
在 data.table 中,在第 6 项之后使用逗号结尾,以避免意外打印大型嵌入对象.在[.data.frame
中,我们经常设置drop = FALSE
.当我们忘记时,在选择单列并且突然返回向量而不是单列data.frame
的极端情况下可能会出现错误.在[.data.table
中,我们借此机会使其保持一致并删除了drop
.当一个 data.table 被传递给一个 data.table-unaware 包时,这个包不关心任何这些差异;它只是工作.
NA
行<小时>
小警告
在某些情况下,某些包使用的代码在给定 data.frame 时可能会崩溃,但是,鉴于 data.table
一直在维护以避免此类问题,任何可能出现的问题出现会及时修复.
例如
来自 v 1.8.2 的新闻
- base::unname(DT) 现在可以根据 plyr::melt() 的需要再次工作.谢谢Christoph Jaeckel 进行报道.添加了测试.
- 为ITime添加了一个as.data.frame方法,这样就可以将ITime传递给ggplot2没有错误,#1713.感谢 Farrel Buchinsky 的报道.添加了测试.ITime 轴标签仍然显示为从午夜开始的整数秒;我们不知道为什么 ggplot2不调用 ITime 的 as.character 方法.为 ggplot2 将 ITime 转换为 POSIXct 是一种方法.
I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally different fromj
in[.data.frame
. Even something as simple asDF[,1]
would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).Furthermore,
data.table
inherits fromdata.frame
. It is adata.frame
, too. Adata.table
can be passed to any package that only acceptsdata.frame
and that package can use[.data.frame
syntax on thedata.table
.We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.
c.A second proposal was to use
memcpy
induplicate.c
, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame
and data.table
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3, ] == DT[3]
, butDF[ , 3] == DF[3]
(somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT
, but not optional inDF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
, wherei
is a single integer, returns a single row, just likeDF[i, ]
, but unlike a matrix single-row subset which returns a vector.DT[ , j]
wherej
is a single integer returns a one-column data.table, unlikeDF[, j]
which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
returns 1 row ofNA
, butDF[NA]
returns an entire copy ofDF
containingNA
throughout. The symbolNA
is typelogical
in R and is therefore recycled by[.data.frame
. The user's intention was probablyDF[NA_integer_]
.[.data.table
diverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE, NA, FALSE)]
returnsNA
rows for eachNA
DT[ColA == ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
creates 3 columns, data.table creates onelist
column.check.names
is by defaultTRUE
indata.frame
butFALSE
in data.table, for convenience.stringsAsFactors
is by defaultTRUE
indata.frame
butFALSE
in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor
.- Atomic vectors in
list
columns are collapsed when printed using", "
indata.frame
, but","
in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In[.data.frame
we very often setdrop = FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame
. In[.data.table
we took the opportunity to make it consistent and droppeddrop
. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
From the NEWS for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
这篇关于你可以用 data.frame 做什么而不能用 data.table 做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!