聚合的意外输出 [英] unexpected output from aggregate

查看:99
本文介绍了聚合的意外输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用aggregate尝试另一个问题时

While experimenting with aggregate for another question here, I encountered a rather strange result. I'm unable to figure out why and am wondering if what I'm doing is totally wrong.

假设我有一个这样的data.frame:

df <- structure(list(V1 = c(1L, 2L, 1L, 2L, 3L, 1L), 
                     V2 = c(2L, 3L, 2L, 3L, 4L, 2L), 
                     V3 = c(3L, 4L, 3L, 4L, 5L, 3L), 
                     V4 = c(4L, 5L, 4L, 5L, 6L, 4L)), 
                  .Names = c("V1", "V2", "V3", "V4"), 
        row.names = c(NA, -6L), class = "data.frame")
> df
#   V1 V2 V3 V4
# 1  1  2  3  4
# 2  2  3  4  5
# 3  1  2  3  4
# 4  2  3  4  5
# 5  3  4  5  6
# 6  1  2  3  4

现在,如果我想输出带有唯一行data.frame,并带有附加列以指示其在df中的频率.对于此示例,

Now, if I want to output a data.frame with unique rows with an additional column indicating their frequency in df. For this example,

#   V1 V2 V3 V4 x
# 1  1  2  3  4 3
# 2  2  3  4  5 2
# 3  3  4  5  6 1

我通过aggregate通过以下实验获得了此输出:

I obtained this output using aggregate by experimenting as follows:

> aggregate(do.call(paste, df), by=df, print)

# [1] "1 2 3 4" "1 2 3 4" "1 2 3 4"
# [1] "2 3 4 5" "2 3 4 5"
# [1] "3 4 5 6"
#   V1 V2 V3 V4                         x
# 1  1  2  3  4 1 2 3 4, 1 2 3 4, 1 2 3 4
# 2  2  3  4  5          2 3 4 5, 2 3 4 5
# 3  3  4  5  6                   3 4 5 6

所以,这给了我粘贴的字符串.因此,如果我使用length而不是print,它应该给我这样的出现次数,这是期望的结果,确实是这种情况(如下所示).

So, this gave me the pasted string. So, if I were to use length instead of print, it should give me the number of such occurrences, which is the desired result, which was the case (as shown below).

> aggregate(do.call(paste, df), by=df, length)
#   V1 V2 V3 V4 x
# 1  1  2  3  4 3
# 2  2  3  4  5 2
# 3  3  4  5  6 1

这似乎行得通.但是,当data.frame尺寸为4 * 2500时,输出data.frame为1 * 2501而不是4 * 2501(所有行都是唯一的,因此频率为1).

And this seemed to work. However, when the data.frame dimensions are 4*2500, the output data.frame is 1*2501 instead of 4*2501 (all rows are unique, so the frequency is 1).

> df <- as.data.frame(matrix(sample(1:3, 1e4, replace = TRUE), nrow=4))
> o <- aggregate(do.call(paste, df), by=df, length)
> dim(o)
# [1]    1 2501

我用只有唯一行的较小data.frames进行了测试,它给出了正确的输出(例如,更改nrow=40).但是,当矩阵的尺寸增加时,这似乎不起作用.而且我根本不知道出了什么问题!有什么想法吗?

I tested with smaller data.frames with just unique rows and it gives the right output (change nrow=40, for example). However, when the dimensions of the matrix increase, this doesn't seem to work. And I just can't figure out what's going wrong! Any ideas?

推荐答案

这里的问题是aggregate.data.frame()如何确定组.

The issue here is how aggregate.data.frame() determines the groups.

aggregate.data.frame()中,存在一个循环,该循环形成了分组变量grp.在该循环中,grp通过以下方式更改/更新:

In aggregate.data.frame() there is a loop which forms the grouping variable grp. In that loop, grp is altered/updated via:

grp <- grp * nlevels(ind) + (as.integer(ind) - 1L)

您的示例出现问题,如果将by转换为因子,并且循环遍历了所有这些因子 ,则在您的示例中grp最终是:

The problem with your example if that once by is converted to factors, and the loop has gone over all of these factors, in your example grp ends up being:

Browse[2]> grp
[1] Inf Inf Inf Inf

本质上,循环更新将grp的值推到与Inf不可区分的数字.

Essentially the looping update pushed the values of grp to a number indistinguishable from Inf.

做到这一点,aggregate.data.frame()稍后再做

y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]

,这是以前的问题现在表现为

and this is where the earlier problem now manifests itself as

dim(y[match(sort(unique(grp)), grp, 0L), , drop = FALSE])

因为

match(sort(unique(grp)), grp, 0L)

显然只返回1:

> match(sort(unique(grp)), grp, 0L)
[1] 1

因为grp只有一个唯一值.

这篇关于聚合的意外输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆