使用ddply进行汇总统计 [英] Summary statistics using ddply

查看:144
本文介绍了使用ddply进行汇总统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我喜欢使用 ddply 编写一个函数,它基于两列 data.frame mat




  • mat 是一个大的数据。框架,列名为metric,length,species,tree,...,index / p>


  • 索引是2级的因子短,长 / code>


  • metric,length,species,tree c>和其他都是连续变量




功能:

  summary1<  -  function(arg1,arg2){
...

ss< - ddply(mat,。(index) function(X)data.frame(
arg1 = as.list(summary(X $ arg1)),
arg2 = as.list(summary(X $ arg2)),
.parallel = FALSE)

ss
}

我预计输出调用 summary1(metric,length)后看起来像这样。

  short metric.Min。metric.1st.Qu。metric.Median metric.Mean metric.3rd.Qu。metric.Max。length.Min。length.1st.Qu。length 
.Me平均长度。长

....

长度。 metric.1st.Qu。公制度量。公尺。长度。 length.1st.Qu。长度
.Median length.Mean length.3rd.Qu。长

....

此时该函数不产生期望输出?在这里应该做什么修改?



感谢您的帮助。






这是一个玩具示例

 code> mat<  -  data.frame(
metric = rpois(10,10),length = rpois(10,10),species = rpois(10,10),
tree = rpois(10,10),index = c(rep(Short,5),rep(Long,5))


解决方案

As Nick在他的回答中写道,你不能使用 $ 引用作为字符名传递的变量。当您写入 X $ arg1 然后 R 搜索名为arg1 data.frame X 。您可以通过 X [,arg1] X [[arg1]]



如果你想要很好的命名输出,我提出以下解决方案:

  summary1< -  function(arg1,arg2){

ss< - ddply(mat,。(index),function(X)data.frame(
setNames(
list .list(summary(X [[arg1]])),as.list(summary(X [[arg2]]))),
c(arg1,arg2)
)),.parallel = FALSE )

ss
}
summary1(metric,length)

玩具数据的输出是:

  index metric.Min。 metric.1st.Qu。公制度量。 
1长5 7 10 8.6 10
2短7 7 9 8.8 10
metric.Max。长度。 length.1st.Qu。长度。
1 11 9 10 11 10.8 12
2 11 4 9 9 9.0 11
length.Max。
1 12
2 12


I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.

  • mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"

  • index is factor with 2 levels "Short", "Long"

  • "metric", "length", "species", "tree" and others are all continuous variables

Function:

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

I expect the output to look like this after calling summary1("metric","length")

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

At the moment the function does not produce the desired output? What modification should be made here?

Thanks for your help.


Here is a toy example

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

解决方案

As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].

And if you want nicely named output I propose below solution:

summary1 <- function(arg1, arg2) {

    ss <- ddply(mat, .(index), function(X) data.frame(
        setNames(
            list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
            c(arg1,arg2)
            )), .parallel = FALSE)

    ss
}
summary1("metric","length")

Output for toy data is:

  index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
1  Long           5              7            10         8.6             10
2 Short           7              7             9         8.8             10
  metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
1          11           9             10            11        10.8             12
2          11           4              9             9         9.0             11
  length.Max.
1          12
2          12

这篇关于使用ddply进行汇总统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆