您可以使用data.frame而不使用data.table做什么? [英] What you can do with a data.frame that you can't with a data.table?

查看:124
本文介绍了您可以使用data.frame而不使用data.table做什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始使用R,并且遇到了data.table。我发现它很棒。

I just started using R, and came across data.table. I found it brilliant.

一个非常幼稚的问题:我可以忽略data.frame来使用data.table以避免两个软件包之间的语法混淆吗?

A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?

推荐答案

来自 data.table常见问题解答


常见问题解答1.1突出显示了 j [。data.table 中的
中的 j 基本上是
[ .data.frame
。即使是像
DF [,1] 这样简单的东西,也会破坏许多程序包和用户代码中的现有代码。
这是设计使然,我们希望它能以这种方式工作,以使更多
个复杂语法起作用。还有其他区别(请参见常见问题解答
2.17)。

As FAQ 1.1 highlights, j in [.data.table is fundamentally different from j in [.data.frame. Even something as simple as DF[,1] would break existing code in many packages and user code. This is by design, and we want it to work this way for more complicated syntax to work. There are other differences, too (see FAQ 2.17).

此外, data.table 继承来自 data.frame 。这也是
data.frame 。可以将 data.table 传递给
仅接受 data.frame 的任何程序包,并且该程序包可以在 data.table 上使用 [。data.frame
语法。

Furthermore, data.table inherits from data.frame. It is a data.frame, too. A data.table can be passed to any package that only accepts data.frame and that package can use [.data.frame syntax on the data.table.

我们也尽可能建议对R进行增强。其中
中的一个已被R 2.12.0接受为新功能:

We have proposed enhancements to R wherever possible, too. One of these was accepted as a new feature in R 2.12.0 :


unique() match()现在在所有元素都位于全局 CHARSXP 缓存,并具有未标记的
编码(ASCII)。感谢Matthew Dowle建议改进
来改进 unique。 c。

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements to the way the hash code is generated in unique.c.

第二个建议是在 duplicate.c 中使用 memcpy
比C中的for循环快得多。这将改善R内部复制
数据的方式(在某些情况下提高13倍)。 r-devel
上的线程在这里: http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html

A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.



data.frame 和data.table



What are the smaller syntax differences between data.frame and data.table


之间的较小语法区别是什么

  • DT [3] 指第三行,但 DF [3 ] 指第三个

  • DT [3,] == DT [3] ,但 DF [,3] == DF [3] (在data.frame中有些令人困惑,而data.table是一致的)

  • 因此,我们说逗号在 DT 中是可选,但在 DF中不是可选的

  • DT [[3]] == DF [,3] == DF [[3]]

  • DT [i,] ,其中 i 是单个整数,返回单行,就像 DF [i,] ,但与矩阵单行子集不同

  • DT [,j] 其中 j 是一个向量与 DF [,j] 不同的是,单个整数返回一个单列数据表。

  • DT [, colA] [[1]] == DF [, colA]

  • DT [,colA] == DF [, colA] (当前在data.table v1.9.8中,但将要更改,请参见发行说明)

  • DT [,list(colA)] == DF [, colA,drop = FALSE]

  • DT [NA] 返回1行 NA ,但 DF [NA] 返回整个 DF 的副本,其中全部包含 NA 。符号 NA 在R中的类型为 logic ,因此由 [。data.frame 。用户的意图可能是 DF [NA_integer _] [。data.table 为方便起见,会自动转移到此可能的意图。

  • DT [c(TRUE ,NA,FALSE)] NA 视为 FALSE ,但将 DF [c(TRUE,NA,FALSE)] 为每个 NA <返回
    NA 行/ code>

  • DT [ColA == ColB] DF [!is .na(ColA)& !is.na(ColB)& ColA == ColB,]

  • data.frame(list(1:2, k,1:4))创建3列,data.table创建一个列表列。

  • 检查。名称 data.frame 中默认为 TRUE ,但 FALSE 表中的,为方便起见。

  • stringsAsFactors 默认为 TRUE为提高效率,在 data.frame 中使用,但在data.table中使用 FALSE 。由于在R中添加了全局字符串缓存,因此字符项是指向单个缓存字符串的指针,转换为 factor 不再具有性能优势。
  • 当使用<$ c中的打印时,列表列中的
  • 原子向量被折叠$ c> data.frame ,但在data.table中的在第六项之后带有逗号,以避免意外打印大型嵌入对象。
    [。data.frame 中,我们经常设置 drop = FALSE 。当我们忘记时,在某些情况下会出现错误,即选择单列并突然返回向量而不是单列 data.frame 。在 [。data.table 中,我们借此机会使其保持一致,并删除了 drop
    当将data.table传递给不知道data.table的程序包时,该程序包与这些差异无关;

  • DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
  • DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
  • For this reason we say the comma is optional in DT, but not optional in DF
  • DT[[3]] == DF[, 3] == DF[[3]]
  • DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
  • DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
  • DT[ , "colA"][[1]] == DF[ , "colA"].
  • DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
  • DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
  • DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
  • DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns NA rows for each NA
  • DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
  • data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
  • check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
  • stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
  • Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects. In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop. When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.






小警告

在某些情况下,某些软件包使用的代码在给定data.frame时会掉落,但是鉴于 data.table 一直在维护以避免这些问题,可能会出现的任何问题都会得到及时解决。


Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

例如

来自1.8.1版的新闻

From the NEWS for v 1.8.2



  • base :: unname(DT)现在可以再次工作,如plyr :: melt()所需要。感谢
    Christoph Jaeckel的报告。已添加测试。

  • 已为ITime添加了as.data.frame方法,以便可以将ITime正确无误地传递给ggplot2
    ,#1713。感谢Farrel Buchinsky的报告。测试已添加。
    ITime轴标签从午夜起仍显示为整数秒;我们不知道为什么ggplot2
    不调用ITime的as.character方法。将ggplot2的ITime转换为POSIXct是一种方法。

这篇关于您可以使用data.frame而不使用data.table做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆