你可以做什么data.frame你不能在data.table [英] what you can do with data.frame that you can't in data.table
问题描述
我刚刚开始使用R,遇到了data.table。我发现它辉煌。
I just started using R, and came across data.table. I found it brilliant.
一个非常幼稚的问题:我可以忽略data.frame以使用data.table来避免两个包之间的语法混淆吗?
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
推荐答案
如常见问题1.1亮点,
j
[。data.table
基本上是
不同于j
在.data.frame
。即使像
DF [,1]
这样简单,也会破坏许多软件包和用户代码中的现有代码。
这是设计,我们希望它以这种方式为更多的
复杂的语法工作。还有其他差异(见常见问题
2.17)。
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally different fromj
in[.data.frame
. Even something as simple asDF[,1]
would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).
此外, data.table
inherits从 data.frame
。它也是一个
data.frame
。可以将 data.table
传递给任何包,
只接受 data.frame
在 data.table
上使用 [。data.frame
语法。
Furthermore, data.table
inherits from data.frame
. It is a
data.frame
, too. A data.table
can be passed to any package that
only accepts data.frame
and that package can use [.data.frame
syntax on the data.table
.
我们已经提出了对R的增强,尽可能。
中的一个被接受为R 2.12.0中的新功能:
We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :
unique 和
match()
现在在字符向量上更快,其中所有元素都在全局CHARSXP
缓存,并且未标记
编码(ASCII)。感谢Matthew Dowle对独特的散列代码生成方式的改进
c。
。
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated inunique.
c.
第二个建议是在 duplicate.c
中使用 memcpy
多
比在C中的for循环更快。这将改进R在内部复制
数据的方式(在某些度量上13次)。 r-devel
上的线程位于: http:// tolstoy .newcastle.edu.au / R / e10 / devel / 10/04 / 0148.html 。
A second proposal was to use memcpy
in duplicate.c
, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
是data.frame和data.table之间较小的语法差异吗?
2.17 What are the smaller syntax differences between data.frame and data.table?
DT [3]
指第3行
DT [3,] == DT [3],
但DF [,3] == DF [3]
(有点混乱)
- 因此,我们说逗号在DT中是可选的,但在DF中不是可选的
[[3]] == DF [3] == DF [[3]]
code>其中i是单个整数返回单个行,就像
DF [i,]
,但不像矩阵单行子集返回
向量。
DT [,j,with = FALSE]
其中j是单个整数返回一列data.table,不同于默认返回向量的DF [,j]
DT [,colA = FALSE] [[1]] == DF [,colA]
。
DT [,colA] == DF [,colA]
DT [,list(colA)] == DF [,colA,drop = FALSE ]
DT [NA]
返回1行NA,但DF [NA]
返回包含NA的DF的副本。 $是R中的类型逻辑,因此由
[。data.frame
。意图可能DF [NA_integer _]
。
DT [c(TRUE(TRUE)]
将NA视为FALSE,但
[。data.table
,NA,FALSE)]DF [c(TRUE,NA,FALSE)]
返回NA行
每个NA
DT [ColA == ColB]
比DF [!is.na(ColA)& !is.na(ColB)& ColA == ColB,]
data.frame(list(1:2,k,1:4) / code>创建3列,
data.table
创建一个列表列。
在
中的默认值
为
,但
FALSE c> c> c> 为了提高效率,$ c>在
。当我们忘记时,在选择单列并且返回向量的所有data.frame
中默认为TRUE,但在data.table
中为FALSE。 >
- 由于将全局字符串缓存添加到R,字符项是指向单个缓存字符串的指针,因此不再有
的性能优势。
- 在data.frame中使用,打印时,列表列中的原子向量将被折叠,但在data.table中的,后面带有逗号后面的
第6个项目,以避免意外打印大嵌入对象。
- 在
[。data.frame
drop = FALSE
而不是单个列
data.frame的边缘情况下会出现错误。在[。data.table
中,我们利用机会使它
一致并丢弃。
- 当数据。表传递给data.table-unaware包,该包不关心任何这些差异;它只是工作
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3,] == DT[3],
butDF[,3] == DF[3]
(somewhat confusingly)- For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[3] == DF[[3]]
DT[i,]
where i is a single integer returns a single row, just likeDF[i,]
, but unlike a matrix single row subset which returns a vector.DT[,j,with=FALSE]
where j is a single integer returns a one column data.table, unlikeDF[,j]
which returns a vector by defaultDT[,"colA",with=FALSE][[1]] == DF[,"colA"]
.DT[,colA] == DF[,"colA"]
DT[,list(colA)] == DF[,"colA",drop=FALSE]
DT[NA]
returns 1 row of NA, butDF[NA]
returns a copy of DF containing NA throughout.- The symbol
NA
is type logical in R, and is therefore recycled by[.data.frame
. Intention wasprobablyDF[NA_integer_]
.[.data.table
does this automatically for convenience.DT[c(TRUE,NA,FALSE)]
treats the NA as FALSE, butDF[c(TRUE,NA,FALSE)]
returns NA rows
for eachNA
DT[ColA==ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
data.frame(list(1:2,"k",1:4))
creates 3 columns,data.table
creates one list column.check.names
is by defaultTRUE
indata.frame
butFALSE
indata.table
, for convenience.stringsAsFactors
is by default TRUE indata.frame
but FALSE indata.table
, for efficiency.- Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
- Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
- In
[.data.frame
we very often setdrop=FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In[.data.table
we took the opportunity to make it consistent and drop drop.- When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works
小注意事项
可能会有一些情况下,一些包使用的代码下降,当给一个data.frame,但是,由于 data.table $ c $
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
例如
- base :: unname(DT)现在可以正常工作,根据plyr :: melt感谢
Christoph Jaeckel报告。已添加测试。
- 为ITime添加了一个as.data.frame方法,以便ITime可以传递给ggplot2
,而不会出错;#1713。感谢Farrel Buchinsky的报告。测试添加。
ITime轴标签仍然从午夜显示为整数秒;我们不知道为什么ggplot2
不调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2是一种方法。
这篇关于你可以做什么data.frame你不能在data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!