R core 的`split` 函数背后的算法是什么? [英] What is the algorithm behind R core's `split` function?

查看:58
本文介绍了R core 的`split` 函数背后的算法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

split 是 R 核心中一个特别重要的功能.许多提供基于 R 的数据操作解决方案的 Stack Overflow 答案都依赖于它.这是任何分组操作的主力例程.

split is an especially important function in R core. Many Stack Overflow answers offering R-base solutions on data manipulation rely on it. It is the workhorse routine of any group-by operations.

还有很多问题的解决方案只是一行split.很多人不知道

There are also many questions whose solution is just a single line with split. Many people do not know that

  • split.data.frame 可以按行分割矩阵;
  • split.default 可以按列拆分数据框.
  • split.data.frame can split a matrix by row;
  • split.default can split a data frame by column.

也许关于 split 的 R 文档做得不太好.它确实提到了第一个用途,但没有提到第二个.

Perhaps R documentation on split is not doing very well. It does mention the first use, but does not mention the second.

在R核心中有四种split的方法:

There are four methods for split in R core:

methods(split)
#[1] split.data.frame split.Date       split.default    split.POSIXct

我将提供一个答案,深入解释 split.data.framesplit.default 和 C 级 .Internal(split(x,f)) 工作.欢迎在日期"和POSIXct"对象上提供其他答案.

I will provide an answer explaining in depth how split.data.frame, split.default and the C-level .Internal(split(x, f)) work. Other answers are welcomed on "Date" and "POSIXct" object.

推荐答案

split.data.frame 是如何工作的?

How does split.data.frame work?

function (x, f, drop = FALSE, ...) 
lapply(split(x = seq_len(nrow(x)), f = f, drop = drop, ...), 
       function(ind) x[ind, , drop = FALSE])

调用split.default分割行索引向量seq_len(nrow(x)),然后使用lapply循环提取关联的行到列表条目中.

It calls split.default to split row index vector seq_len(nrow(x)), then use an lapply loop to extract associated rows into a list entry.

这不是严格意义上的data.frame"方法.它按第一维分割任何二维对象,包括按行分割矩阵.

This isn't strictly a "data.frame" method. It splits any 2-dimensional objects by the 1st dimension, including splitting a matrix by rows.

function (x, f, drop = FALSE, sep = ".", lex.order = FALSE, ...) 
{
if (!missing(...)) 
    .NotYetUsed(deparse(...), error = FALSE)
if (is.list(f)) 
    f <- interaction(f, drop = drop, sep = sep, lex.order = lex.order)
else if (!is.factor(f)) 
    f <- as.factor(f)
else if (drop) 
    f <- factor(f)
storage.mode(f) <- "integer"
if (is.null(attr(x, "class"))) 
    return(.Internal(split(x, f)))
lf <- levels(f)
y <- vector("list", length(lf))
names(y) <- lf
ind <- .Internal(split(seq_along(x), f))
for (k in lf) y[[k]] <- x[ind[[k]]]
y
}

  1. 如果 x 没有类(即,主要是一个原子向量),则使用 .Internal(split(x, f));
  2. 否则,它使用 .Internal(split()) 沿 x 拆分索引,然后使用 for 循环提取关联的元素到列表条目中.
  1. if x has no classes (i.e., mostly an atomic vector), .Internal(split(x, f)) is used;
  2. otherwise, it uses .Internal(split()) to split the index along x, then uses a for loop to extract associated elements into a list entry.

原子向量(参见?vector)是具有以下模式的向量:

An atomic vector (see ?vector) is a vector with the following mode:

  • 逻辑"、整数"、数字"、复数"、字符"和原始"
  • 列表"
  • 表达"

一个有类的对象......呃......有这么多!!让我举三个例子:

An object with class... Er... there are so many!! Let me just give three examples:

  • 因素"
  • 数据框架"
  • 矩阵"

在我看来 split.default 写得不好.有很多带有类的对象,但是split.default 会通过"[" 以同样的方式处理它们.这适用于因子"和data.frame"(因此我们将沿列拆分数据框!),但它绝对不能以我们期望的方式使用矩阵.

In my opinion the split.default is not well written. There are so many objects with classes, yet split.default would deal with them in the same way via"[". This works fine with "factor" and "data.frame" (so we will be splitting data frame along the columns!), but it definitely does not work with a matrix in a way we expect.

A <- matrix(1:9, 3)
#     [,1] [,2] [,3]
#[1,]    1    4    7
#[2,]    2    5    8
#[3,]    3    6    9

split.default(A, c(1, 1, 2))  ## it does not split the matrix by columns!
#$`1`
#[1] 1 2 4 5 7 8
#
#$`2`
#[1] 3 6 9

实际上回收规则已经应用到c(1, 1, 2)上,我们等价的做了:

Actually recycling rule has been applied to c(1, 1, 2), and we are equivalently doing:

split(c(A), rep_len(c(1,1,2), length(A)))

为什么R核心不为矩阵"写另一行,比如

Why doesn't R core write another line for a "matrix", like

for (k in lf) y[[k]] <- x[, ind[[k]], drop = FALSE]

到目前为止,按列安全地拆分矩阵的唯一方法是转置它,然后是 split.data.frame,然后是另一个转置.

Till now the only way to safely split a matrix by columns is to transpose it, then split.data.frame, then another transpose.

lapply(split.data.frame(t(A), c(1, 1, 2)), t)

通过 lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix) 的另一种解决方法是错误的,如果 A 是一个字符矩阵.

Another workaround via lapply(split.default(data.frame(A), c(1, 1, 2)), as.matrix) is buggy if A is a character matrix.

这真的是核心的核心!下面我举一个小例子来说明:

This is really the core of the core! I will take a small example below for explanation:

set.seed(0)
f <- sample(factor(letters[1:3]), 10, TRUE)
# [1] c a b b c a c c b b
#Levels: a b c

x <- 0:9

基本上有3个步骤.为了提高可读性,每一步都提供了等效的R代码.

Basically there are 3 steps. To enhance readability, Equivalent R code are provided for each step.

第一步:制表(计算每个因素水平的出现次数)

## a factor has integer mode so `tabulate` works
tab <- tabulate(f, nbins = nlevels(f))
[1] 2 4 4

第 2 步:结果列表的存储分配

result <- vector("list", nlevels(f))
for (i in 1:length(tab)) result[[i]] <- vector(mode(x), tab[i])
names(result) <- levels(f)

我会如下注释这个列表,其中每一行是一个列表元素,在这个例子中是一个向量,每个 [ ] 是该向量条目的占位符.

I would annotate this list as follows, where each line is a list element which is a vector in this example, and each [ ] is a placeholder for an entry of that vector.

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [ ] [ ] [ ] [ ]

第 3 步:元素分配

现在揭示一个因子的内部整数模式很有用:

Now it is useful to uncover the internal integer mode for a factor:

.f <- as.integer(f)
#[1] 3 1 2 2 3 1 3 3 2 2

我们需要扫描x.f,将x[i]填入右边的入口result[[.f[i]]],由累加器缓冲区向量通知.

We need to scan x and .f, filling x[i] into the right entry of result[[.f[i]]], informed by an accumulator buffer vector.

ab <- integer(nlevels(f))  ## accumulator buffer

for (i in 1:length(.f)) {
  fi <- .f[i] 
  counter <- ab[fi] + 1L
  result[[fi]][counter] <- x[i]
  ab[fi] <- counter
  }

在下图中,^ 是指向被访问或更新的元素的指针.

In the following illustration, ^ is a pointer to elements that are accessed or updated.

## i = 1

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
     ^

ab: [0] [0] [0]  ## on entry
             ^

$a: [ ] [ ]

$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]
     ^

ab: [0] [0] [1]  ## on exit
             ^

## i = 2

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
         ^

ab: [0] [0] [1]  ## on entry
     ^

$a: [1] [ ]
     ^
$b: [ ] [ ] [ ] [ ]

$c: [0] [ ] [ ] [ ]

ab: [1] [0] [1]  ## on exit
     ^

## i = 3

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
             ^

ab: [1] [0] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [ ] [ ] [ ]
     ^
$c: [0] [ ] [ ] [ ]

ab: [1] [1] [1]  ## on exit
         ^

## i = 4

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                 ^

ab: [1] [1] [1]  ## on entry
         ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]
         ^
$c: [0] [ ] [ ] [ ]

ab: [1] [2] [1]  ## on exit
         ^

## i = 5

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                     ^

ab: [1] [2] [1]  ## on entry
             ^

$a: [1] [ ]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]
         ^

ab: [1] [2] [2]  ## on exit
             ^

## i = 6

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                         ^

ab: [1] [2] [2]  ## on entry
     ^

$a: [1] [5]
         ^
$b: [2] [3] [ ] [ ]

$c: [0] [4] [ ] [ ]

ab: [2] [2] [2]  ## on exit
     ^

## i = 7

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                             ^

ab: [2] [2] [2]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [ ]
             ^

ab: [2] [2] [3]  ## on exit
             ^

## i = 8

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                 ^

ab: [2] [2] [3]  ## on entry
             ^

$a: [1] [5]

$b: [2] [3] [ ] [ ]

$c: [0] [4] [6] [7]
                 ^

ab: [2] [2] [4]  ## on exit
             ^

## i = 9

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                     ^

ab: [2] [2] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [ ]
             ^
$c: [0] [4] [6] [7]

ab: [2] [3] [4]  ## on exit
         ^

## i = 10

 x: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
.f: [3] [1] [2] [2] [3] [1] [3] [3] [2] [2]
                                         ^

ab: [2] [3] [4]  ## on entry
         ^

$a: [1] [5]

$b: [2] [3] [8] [9]
                 ^
$c: [0] [4] [6] [7]

ab: [2] [4] [4]  ## on exit
         ^

这篇关于R core 的`split` 函数背后的算法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆