您可以使用data.frame而不使用data.table做什么? [英] What you can do with a data.frame that you can't with a data.table?
问题描述
我刚开始使用R,并且遇到了data.table。我发现它很棒。
I just started using R, and came across data.table. I found it brilliant.
一个非常幼稚的问题:我可以忽略data.frame来使用data.table以避免两个软件包之间的语法混淆吗?
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
推荐答案
常见问题解答1.1突出显示了
j
[。data.table
中的
与中的
。即使是像j
基本上是
[ .data.frame
DF [,1]
这样简单的东西,也会破坏许多程序包和用户代码中的现有代码。
这是设计使然,我们希望它能以这种方式工作,以使更多
个复杂语法起作用。还有其他区别(请参见常见问题解答
2.17)。
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally different fromj
in[.data.frame
. Even something as simple asDF[,1]
would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).
此外, data.table
继承来自 data.frame
。这也是
data.frame
。可以将 data.table
传递给
仅接受 data.frame
的任何程序包,并且该程序包可以在 data.table
上使用 [。data.frame
语法。
Furthermore, data.table
inherits from data.frame
. It is a
data.frame
, too. A data.table
can be passed to any package that
only accepts data.frame
and that package can use [.data.frame
syntax on the data.table
.
我们也尽可能建议对R进行增强。其中
中的一个已被R 2.12.0接受为新功能:
We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
unique()
和match()
现在在所有元素都位于全局CHARSXP
缓存,并具有未标记的
编码(ASCII)。感谢Matthew Dowle建议改进
来改进unique。
c。
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.
c.
第二个建议是在 duplicate.c
中使用 memcpy
比C中的for循环快得多。这将改善R内部复制
数据的方式(在某些情况下提高13倍)。 r-devel
上的线程在这里: http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html 。
A second proposal was to use memcpy
in duplicate.c
, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
data.frame
和data.table
What are the smaller syntax differences between data.frame
and data.table
之间的较小语法区别是什么
DT [3]
指第三行行,但DF [3 ]
指第三个列
DT [3,] == DT [3]
,但DF [,3] == DF [3]
(在data.frame中有些令人困惑,而data.table是一致的)
- 因此,我们说逗号在
DT
中是可选,但在DF中不是可选的
DT [[3]] == DF [,3] == DF [[3]]
DT [i,]
,其中i
是单个整数,返回单行,就像DF [i,]
,但与矩阵单行子集不同
DT [,j]
其中j
是一个向量与DF [,j]
不同的是,单个整数返回一个单列数据表。
DT [, colA] [[1]] == DF [, colA]
。
DT [,colA] == DF [, colA]
(当前在data.table v1.9.8中,但将要更改,请参见发行说明)
DT [,list(colA)] == DF [, colA,drop = FALSE]
DT [NA]
返回1行NA
,但DF [NA]
返回整个DF
的副本,其中全部包含NA
。符号NA
在R中的类型为logic
,因此由[。data.frame
。用户的意图可能是DF [NA_integer _]
。[。data.table
为方便起见,会自动转移到此可能的意图。
DT [c(TRUE ,NA,FALSE)]
将NA
视为FALSE
,但将DF [c(TRUE,NA,FALSE)]
为每个NA <返回
NA
行/ code>
DT [ColA == ColB]
比DF [!is .na(ColA)& !is.na(ColB)& ColA == ColB,]
data.frame(list(1:2, k,1:4))
创建3列,data.table创建一个列表
列。
检查。名称
在data.frame
中默认为TRUE
,但FALSE 表中的
,为方便起见。
- 当使用<$ c中的
stringsAsFactors
默认为TRUE为提高效率,在
,但在data.table中使用data.frame
中使用FALSE
。由于在R中添加了全局字符串缓存,因此字符项是指向单个缓存字符串的指针,转换为factor
不再具有性能优势。,
打印时,列表
列中的
- 原子向量被折叠$ c> data.frame ,但在data.table中的
,
在第六项之后带有逗号,以避免意外打印大型嵌入对象。
在[。data.frame
中,我们经常设置drop = FALSE
。当我们忘记时,在某些情况下会出现错误,即选择单列并突然返回向量而不是单列data.frame
。在[。data.table
中,我们借此机会使其保持一致,并删除了drop
。
当将data.table传递给不知道data.table的程序包时,该程序包与这些差异无关;
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3, ] == DT[3]
, butDF[ , 3] == DF[3]
(somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT
, but not optional inDF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
, wherei
is a single integer, returns a single row, just likeDF[i, ]
, but unlike a matrix single-row subset which returns a vector.DT[ , j]
wherej
is a single integer returns a one-column data.table, unlikeDF[, j]
which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
returns 1 row ofNA
, butDF[NA]
returns an entire copy ofDF
containingNA
throughout. The symbolNA
is typelogical
in R and is therefore recycled by[.data.frame
. The user's intention was probablyDF[NA_integer_]
.[.data.table
diverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE, NA, FALSE)]
returnsNA
rows for eachNA
DT[ColA == ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
creates 3 columns, data.table creates onelist
column.check.names
is by defaultTRUE
indata.frame
butFALSE
in data.table, for convenience.stringsAsFactors
is by defaultTRUE
indata.frame
butFALSE
in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor
.- Atomic vectors in
list
columns are collapsed when printed using", "
indata.frame
, but","
in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In[.data.frame
we very often setdrop = FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame
. In[.data.table
we took the opportunity to make it consistent and droppeddrop
. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
小警告
在某些情况下,某些软件包使用的代码在给定data.frame时会掉落,但是鉴于 data.table
一直在维护以避免这些问题,可能会出现的任何问题都会得到及时解决。
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
例如
来自1.8.1版的新闻
From the NEWS for v 1.8.2
- base :: unname(DT)现在可以再次工作,如plyr :: melt()所需要。感谢
Christoph Jaeckel的报告。已添加测试。
- 已为ITime添加了as.data.frame方法,以便可以将ITime正确无误地传递给ggplot2
,#1713。感谢Farrel Buchinsky的报告。测试已添加。
ITime轴标签从午夜起仍显示为整数秒;我们不知道为什么ggplot2
不调用ITime的as.character方法。将ggplot2的ITime转换为POSIXct是一种方法。
这篇关于您可以使用data.frame而不使用data.table做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!