如何使用by并将所有列都暴露给该函数将函数应用于data.table的子集? [英] How to apply a function to a subset of data.table using by and exposing all columns to the function?

查看:92
本文介绍了如何使用by并将所有列都暴露给该函数将函数应用于data.table的子集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按组对 data.table 进行切片时,用于切片数据的变量不在函数执行期间在子集中。我使用 debugonce 进行了演示。

When slicing a data.table by group(s), variables used to slice the data are not in the subset during the function execution. I demonstrate this using debugonce.

library(data.table)
x <- data.table(a = rep(letters[1:4], each = 3), b = rep(c("a", "b"), each = 6), c = rnorm(12))

myfun <- function(y) paste(y$a, y$b, y$c, collapse = "")

> debugonce(myfun)
> x[, myfun(.SD), by = .(b, a)]
debugging in: myfun(.SD)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
            c
1: -1.2662416
2:  0.9818497
3: -0.5395385

我需要的是拆分应用范例的功能,在该示例中,我将根据因素对data.frame进行切片并将该功能应用于所有列,也就是说,还包括

What I'm after is the functionality of the split-sapply paradigm, where I would slice a data.frame according to factor(s) and apply the function to all columns, that is, also including the variables which have been used to slice it (demonstrated below).

> debugonce(myfun)

> sapply(split(x, f = list(x$b, x$a)), FUN = myfun)
debugging in: FUN(X[[i]], ...)
debug: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
a b          c
1: a a -1.2662416
2: a a  0.9818497
3: a a -0.5395385


推荐答案

OP具有将列表作为参数的函数,该列表应包含data.table的 all 列,包括 by 中用于分组的列。

The OP has a function which takes a list as argument which should contain all columns of the data.table including the columns used for grouping in by.

根据 help(。SD)


.SD 是一个数据表,其中包含<$ c的子集每个组的$ c> x 数据,排除中由(或 keyby )。

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by (or keyby).

(强调我的意思)


.BY 是一个列表,其中包含 by 中每个项目的长度为1的向量。如果事先不知道 by ,这将很有用。

.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance.

因此, .BY .SD 相互补充以访问数据表的所有列。

So, .BY and .SD complement each other to access all columns of the data.table.

不要在函数调用中显式地重复 by

Instead of explicitely repeating the by columns in the function call

x[, myfun(c(list(b, a), .SD)), by = .(b, a)]

我们可以使用

x[, myfun(c(.BY, .SD)), by = .(b, a)]



   b a                                                                 V1
1: a a    a a -1.02091215130492a a -0.295107569536843a a 0.77776326093429
2: a b b a -0.369037832486311b a -0.716211663822323b a -0.264799143319049
3: b c      c b -1.39603530693486c b 1.4707902839894c b 0.721925347069227
4: b d   d b -1.15220308230505d b -0.736782242593426d b 0.420986999145651


OP使用 debugonce()来显示参数传递给 myfun()

The OP has used debugonce() to show the argument passed to myfun():

> debugonce(myfun)
> x[, myfun(c(.BY, .SD)), by = .(b, a)]
debugging in: myfun(c(.BY, .SD))
debug at #1: paste(y$a, y$b, y$c, collapse = "")
Browse[2]> y
$b
[1] "a"

$a
[1] "a"

$c
[1] -1.0209122 -0.2951076  0.7777633




另一个示例


使用另一个示例数据集和函数,可能更容易举例说明问题的核心:


Another example

With another sample data set and function it might be easier to exemplify the core of the question:

x <- data.table(a = rep(letters[3:6], each = 3), b = rep(c("x", "y"), each = 6), c = 1:12)
myfun <- function(y) paste(y$a, y$b, y$c, sep = "/", collapse = "-")

x[, myfun(.SD), by = .(b, a)]



   b a             V1
1: x c    //1-//2-//3
2: x d    //4-//5-//6
3: y e    //7-//8-//9
4: y f //10-//11-//12


因此,列 b a 确实会在输出中显示为分组变量,但不会通过 .SD 传递给函数。

So, columns band a do appear in the output as grouping variables but they aren't passed via .SD to the function.

现在,用 .BY 补充 .SD

x[, myfun(c(.BY, .SD)), by = .(b, a)]



   b a                   V1
1: x c    c/x/1-c/x/2-c/x/3
2: x d    d/x/4-d/x/5-d/x/6
3: y e    e/y/7-e/y/8-e/y/9
4: y f f/y/10-f/y/11-f/y/12


data.table的所有列都传递给函数。

all columns of the data.table are passed to the function.

Roland建议通过。BY .SD 作为函数的单独参数。实际上, .BY 是一个列表对象,而 .SD 是一个data.table对象(本质上也是一个列表)允许我们使用 c(.BY,.SD))。在某些情况下,差异可能很重要。

Roland has suggested to pass .BY and .SD as separate parameters to the function. Indeed, .BY is a list object and .SD is a data.table object (which essentially is also a list which allowed us to use c(.BY, .SD)). There might be cases where the difference might matter.

要进行验证,我们可以定义一个函数,该函数将 str()打印为一面影响。该函数仅针对第一个组( .GRP == 1L )调用。

To verify, we can define a function which prints str() as a side effect. The function is only called for the first group (.GRP == 1L).

myfun1 <- function(y) str(y)
x[, if (.GRP == 1L) myfun1(.SD), by = .(b, a)]



Classes ‘data.table’ and 'data.frame':    3 obs. of  1 variable:
 $ c: int  1 2 3
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, ".data.table.locked")= logi TRUE
Empty data.table (0 rows) of 2 cols: b,a



x[, if (.GRP == 1L) myfun1(.BY), by = .(b, a)]



List of 2
 $ b: chr "x"
 $ a: chr "c"
Empty data.table (0 rows) of 2 cols: b,a



x[, if (.GRP == 1L) myfun1(c(.BY, .SD)), by = .(b, a)]



List of 3
 $ b: chr "x"
 $ a: chr "c"
 $ c: int [1:3] 1 2 3
Empty data.table (0 rows) of 2 cols: b,a



附加链接


help(。SD)评论&以下SO问题的答案可能很有用:

Additional links

Beside help(".SD") the comments & answers to the following SO questions might by useful:

  • What does .SD stand for in data.table in R
  • Use of lapply .SD in data.table R

这篇关于如何使用by并将所有列都暴露给该函数将函数应用于data.table的子集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆