你可以用 data.frame 做什么而不能用 data.table 做什么? [英] What you can do with a data.frame that you can't with a data.table?

查看:36
本文介绍了你可以用 data.frame 做什么而不能用 data.table 做什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始使用 R,遇到了 data.table.我发现它很棒.

一个非常幼稚的问题:我可以忽略data.frame来使用data.table来避免两个包之间的语法混淆吗?

解决方案

来自 data.table 常见问题解答

FAQ 1.8 好的,我开始了解 data.table 是关于什么的,但是你为什么不在 R 中增强 data.frame 呢?为什么它必须是一个新包?

<块引用>

正如 FAQ 1.1 强调的那样,[.data.table 中的 j 基本上是不同于 [.data.frame 中的 j.甚至像这样简单的事情DF[,1] 会破坏许多包和用户代码中的现有代码.这是设计使然,我们希望它以这种方式工作更多复杂的语法工作.还有其他差异(请参阅常见问题解答2.17).

此外,data.table 继承自 data.frame.它是一个data.frame 也是.data.table 可以传递给任何包只接受 data.frame 并且该包可以使用 [.data.framedata.table 上的语法.

我们也尽可能地提出了对 R 的增强.之一这些在 R 2.12.0 中被接受为新功能:

<块引用>

unique()match() 现在在所有元素都在全局 CHARSXP 缓存中且未标记的字符向量上更快编码(ASCII).感谢 Matthew Dowle 提出改进建议unique.c.

中生成哈希码的方式

第二个建议是在 duplicate.c 中使用 memcpy,这很比 C 中的 for 循环更快.这将改进 R 复制的方式内部数据(在某些措施上是 13 倍).r-devel 上的线程在这里:http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

data.frame 和 data.table 之间较小的语法差异是什么

<块引用>
  • DT[3] 指的是第三,而DF[3] 指的是第三
  • DT[3, ] == DT[3],但 DF[ , 3] == DF[3] (在 data.frame 中有些混乱,而data.table 是一致的)
  • 因此,我们说逗号在 DT 中是 可选,但在 DF
  • 中不是可选
  • DT[[3]] == DF[, 3] == DF[[3]]
  • DT[i, ],其中 i 是单个整数,返回单行,就像 DF[i, ],但与返回向量的矩阵单行子集不同.
  • DT[ , j] 其中 j 是单个整数,返回单列 data.table,与 DF[, j] 不同默认返回一个向量
  • DT[ , "colA"][[1]] == DF[ , "colA"].
  • DT[ , colA] == DF[ , "colA"](目前在 data.table v1.9.8 中,但即将更改,请参阅发行说明)
  • DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
  • DT[NA] 返回 1 行 NA,但 DF[NA] 返回 DF 的完整副本code> 始终包含 NA.符号 NA 是 R 中的 logical 类型,因此被 [.data.frame 回收.用户的意图可能是DF[NA_integer_].为方便起见,[.data.table 自动转向这个可能的意图.
  • DT[c(TRUE, NA, FALSE)]NA 视为 FALSE,但 DF[c(TRUE, NA, FALSE)] 返回每个 NA
  • NA
  • DT[ColA == ColB]DF[!is.na(ColA) &!is.na(ColB) &ColA == ColB, ]
  • data.frame(list(1:2, "k", 1:4)) 创建 3 列,data.table 创建 list 列.
  • check.namesdata.frame 中默认为 TRUE 而在 data.table 中默认为 FALSE,例如方便.
  • stringsAsFactorsdata.frame 中默认为 TRUE 而在 data.table 中默认为 FALSE,以提高效率.由于在 R 中添加了全局字符串缓存,字符项是指向单个缓存字符串的指针,转换为 factor 不再有性能优势.
  • list 列中的原子向量在使用 ", " in data.frame 打印时会折叠,但 "," 在 data.table 中,在第 6 项之后使用逗号结尾,以避免意外打印大型嵌入对象.在 [.data.frame 中,我们经常设置 drop = FALSE.当我们忘记时,在选择单列并且突然返回向量而不是单列 data.frame 的极端情况下可能会出现错误.在 [.data.table 中,我们借此机会使其保持一致并删除了 drop.当一个 data.table 被传递给一个 data.table-unaware 包时,这个包不关心任何这些差异;它只是工作.

<小时>

小警告

在某些情况下,某些包使用的代码在给定 data.frame 时可能会崩溃,但是,鉴于 data.table 一直在维护以避免此类问题,任何可能出现的问题出现会及时修复.

例如

<块引用>
  • base::unname(DT) 现在可以根据 plyr::melt() 的需要再次工作.谢谢Christoph Jaeckel 进行报道.添加了测试.
  • 为ITime添加了一个as.data.frame方法,这样就可以将ITime传递给ggplot2没有错误,#1713.感谢 Farrel Buchinsky 的报道.添加了测试.ITime 轴标签仍然显示为从午夜开始的整数秒;我们不知道为什么 ggplot2不调用 ITime 的 as.character 方法.为 ggplot2 将 ITime 转换为 POSIXct 是一种方法.

I just started using R, and came across data.table. I found it brilliant.

A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

解决方案

From the data.table FAQ

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

What are the smaller syntax differences between data.frame and data.table

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • DT[[3]] == DF[, 3] == DF[[3]]
  • DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
  • DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
  • DT[ , "colA"][[1]] == DF[ , "colA"].
  • DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
  • DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
  • DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
  • DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns NA rows for each NA
  • DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
  • data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.


Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

For example

  • base::unname(DT) now works again, as needed by plyr::melt(). Thanks to Christoph Jaeckel for reporting. Test added.
  • An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2 without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added. ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2 doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

这篇关于你可以用 data.frame 做什么而不能用 data.table 做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆