你可以做什么data.frame你不能在data.table [英] what you can do with data.frame that you can't in data.table

查看:237
本文介绍了你可以做什么data.frame你不能在data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚开始使用R,遇到了data.table。我发现它辉煌。

I just started using R, and came across data.table. I found it brilliant.

一个非常幼稚的问题:我可以忽略data.frame以使用data.table来避免两个包之间的语法混淆吗?

A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

推荐答案

data.table常见问题 a>


如常见问题1.1亮点, j [。data.table 基本上是
不同于 j .data.frame 。即使像
DF [,1] 这样简单,也会破坏许多软件包和用户代码中的现有代码。
这是设计,我们希望它以这种方式为更多的
复杂的语法工作。还有其他差异(见常见问题
2.17)。

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

此外, data.table inherits从 data.frame 。它也是一个
data.frame 。可以将 data.table 传递给任何包,
只接受 data.frame data.table 上使用 [。data.frame
语法。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

我们已经提出了对R的增强,尽可能。
中的一个被接受为R 2.12.0中的新功能:

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :


unique 和 match()现在在字符向量上更快,其中所有元素都在全局 CHARSXP 缓存,并且未标记
编码(ASCII)。感谢Matthew Dowle对独特的散列代码生成方式的改进
c。

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

第二个建议是在 duplicate.c 中使用 memcpy
比在C中的for循环更快。这将改进R在内部复制
数据的方式(在某些度量上13次)。 r-devel
上的线程位于:
http:// tolstoy .newcastle.edu.au / R / e10 / devel / 10/04 / 0148.html

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.



是data.frame和data.table之间较小的语法差异吗?



2.17 What are the smaller syntax differences between data.frame and data.table?



  • DT [3] 指第3行

  • DT [3,] == DT [3], DF [,3] == DF [3] (有点混乱)

  • 因此,我们说逗号在DT中是可选的,但在DF中不是可选的

  • [[3]] == DF [3] == DF [[3]]

  • code>其中i是单个整数返回单个行,就像 DF [i,] ,但不像矩阵单行子集返回
    向量。

  • DT [,j,with = FALSE] 其中j是单个整数返回一列data.table,不同于默认返回向量的 DF [,j]

  • DT [,colA = FALSE] [[1]] == DF [,colA]

  • DT [,colA] == DF [,colA]

  • DT [,list(colA)] == DF [,colA,drop = FALSE ]

  • DT [NA] 返回1行NA,但 DF [NA] 返回包含NA的DF的副本。 $ 是R中的类型逻辑,因此由 [。data.frame 。意图可能 DF [NA_integer _]

  • DT [c(TRUE(TRUE)]
    [。data.table ,NA,FALSE)]
    将NA视为FALSE,但 DF [c(TRUE,NA,FALSE)] 返回NA行

    每个 NA

  • DT [ColA == ColB] DF [!is.na(ColA)& !is.na(ColB)& ColA == ColB,]

  • data.frame(list(1:2,k,1:4) / code>创建3列, data.table 创建一个列表列。

  • 中的默认值,但 FALSE c> c> c> 为了提高效率,$ c>在 data.frame 中默认为TRUE,但在 data.table 中为FALSE。 >
  • 由于将全局字符串缓存添加到R,字符项是指向单个缓存字符串的指针,因此不再有
    的性能优势。

  • 在data.frame中使用,打印时,列表列中的原子向量将被折叠,但在data.table中的,后面带有逗号后面的
    第6个项目,以避免意外打印大嵌入对象。

  • [。data.frame drop = FALSE 。当我们忘记时,在选择单列并且返回向量的所有
    而不是单个列
    data.frame的边缘情况下会出现错误。在 [。data.table 中,我们利用机会使它
    一致并丢弃。

  • 当数据。表传递给data.table-unaware包,该包不关心任何这些差异;它只是工作

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3,] == DT[3], but DF[,3] == DF[3] (somewhat confusingly)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • DT[[3]] == DF[3] == DF[[3]]
  • DT[i,] where i is a single integer returns a single row, just like DF[i,], but unlike a matrix single row subset which returns a vector.
  • DT[,j,with=FALSE] where j is a single integer returns a one column data.table, unlike DF[,j] which returns a vector by default
  • DT[,"colA",with=FALSE][[1]] == DF[,"colA"].
  • DT[,colA] == DF[,"colA"]
  • DT[,list(colA)] == DF[,"colA",drop=FALSE]
  • DT[NA] returns 1 row of NA, but DF[NA] returns a copy of DF containing NA throughout.
  • The symbol NA is type logical in R, and is therefore recycled by [.data.frame. Intention wasprobably DF[NA_integer_]. [.data.table does this automatically for convenience.
  • DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows
    for each NA
  • DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
  • data.frame(list(1:2,"k",1:4)) creates 3 columns, data.table creates one list column.
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency.
  • Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of coverting to factor.
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
  • In [.data.frame we very often set drop=FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and drop drop.
  • When a data.table is passed to a data.table-unaware package, that package it not concerned with any of these differences; it just works






小注意事项



可能会有一些情况下,一些包使用的代码下降,当给一个data.frame,但是,由于 data.table


Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

例如



  • base :: unname(DT)现在可以正常工作,根据plyr :: melt感谢
    Christoph Jaeckel报告。已添加测试。

  • 为ITime添加了一个as.data.frame方法,以便ITime可以传递给ggplot2
    ,而不会出错;#1713。感谢Farrel Buchinsky的报告。测试添加。
    ITime轴标签仍然从午夜显示为整数秒;我们不知道为什么ggplot2
    不调用ITime的as.character方法。将ITime转换为POSIXct for ggplot2是一种方法。

这篇关于你可以做什么data.frame你不能在data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆