如何在不排序的情况下ddply()? [英] How to ddply() without sorting?

查看:105
本文介绍了如何在不排序的情况下ddply()?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码总结我的数据,按化合物,复制和质量分组.

I use the following code to summarize my data, grouped by Compound, Replicate and Mass.

summaryDataFrame <- ddply(reviewDataFrame, .(Compound, Replicate, Mass), 
  .fun = calculate_T60_Over_T0_Ratio)

一个不幸的副作用是,结果数据帧按这些字段排序.我想这样做,使Compound,Replicate和Mass的顺序与原始数据框中的顺序相同.有任何想法吗?我尝试在原始数据中添加连续整数的排序"列,但由于我不想对其进行分组",因此我当然不能将其包含在.variables中,因此它不会在summaryDataFrame.

An unfortunate side effect is that the resulting data frame is sorted by those fields. I would like to do this and keep Compound, Replicate and Mass in the same order as in the original data frame. Any ideas? I tried adding a "Sorting" column of sequential integers to the original data, but of course I can't include that in the .variables since I don't want to 'group by' that, and so it is not returned in the summaryDataFrame.

感谢您的帮助.

推荐答案

此问题早已出现在plyr邮件列表中(由@kohske提出),这是Peter Meil​​strup在有限情况下提供的解决方案:

This came up on the plyr mailing list a while back (raised by @kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:

#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) { 
  col <- ".sortColumn"
  data[,col] <- 1:nrow(data) 
  out <- fn(data, ...) 
  if (!col %in% colnames(out)) stop("Ordering column not preserved by function") 
  out <- out[order(out[,col]),] 
  out[,col] <- NULL 
  out 
} 

#Some sample data 
d <- structure(list(g = c(2L, 2L, 1L, 1L, 2L, 2L), v = c(-1.90127112738315, 
-1.20862680183042, -1.13913266070505, 0.14899803094742, -0.69427656843677, 
0.872558638137971)), .Names = c("g", "v"), row.names = c(NA, 
-6L), class = "data.frame") 

#This one resorts
ddply(d, .(g), mutate, v=scale(v)) #does not preserve order of d 

#This one does not
keeping.order(d, ddply, .(g), mutate, v=scale(v)) #preserves order of d 

请务必阅读

Please do read the thread for Hadley's notes about why this functionality may not be general enough to roll into ddply, particularly as it probably applies in your case as you are likely returning fewer rows with each piece.

经过编辑以包含针对更一般情况的策略

如果ddply输出的内容是按顺序排序的,那么您基本上不喜欢两个选项:预先使用有序因子在拆分变量上指定所需的排序,或者在事后手动对输出进行排序.

If ddply is outputting something that is sorted in an order you do not like you basically have two options: specify the desired ordering on the splitting variables beforehand using ordered factors, or manually sort the output after the fact.

例如,考虑以下数据:

d <- data.frame(x1 = rep(letters[1:3],each = 5), 
                x2 = rep(letters[4:6],5),
                x3 = 1:15,stringsAsFactors = FALSE)

暂时使用字符串. ddply将对输出进行排序,在这种情况下,将需要默认的词法顺序:

using strings, for now. ddply will sort the output, which in this case will entail the default lexical ordering:

> ddply(d,.(x1,x2),summarise, val = sum(x3))
  x1 x2 val
1  a  d   5
2  a  e   7
3  a  f   3
4  b  d  17
5  b  e   8
6  b  f  15
7  c  d  13
8  c  e  25
9  c  f  27


> ddply(d[sample(1:15,15),],.(x1,x2),summarise, val = sum(x3))
  x1 x2 val
1  a  d   5
2  a  e   7
3  a  f   3
4  b  d  17
5  b  e   8
6  b  f  15
7  c  d  13
8  c  e  25
9  c  f  27

如果结果数据帧未按照正确"的顺序结束,则可能是因为您确实希望其中一些变量成为有序因素.假设我们真的想要x1x2像这样排序:

If the resulting data frame isn't ending up in the "right" order, it's probably because you really want some of those variables to be ordered factors. Suppose that we really wanted x1 and x2 ordered like so:

d$x1 <- factor(d$x1, levels = c('b','a','c'),ordered = TRUE)
d$x2 <- factor(d$x2, levels = c('d','f','e'), ordered = TRUE)

现在,当我们使用ddply时,结果将按照我们的预期进行排序:

Now when we use ddply, the resulting sort will be as we intend:

> ddply(d,.(x1,x2),summarise, val = sum(x3))
  x1 x2 val
1  b  d  17
2  b  f  15
3  b  e   8
4  a  d   5
5  a  f   3
6  a  e   7
7  c  d  13
8  c  f  27
9  c  e  25

这里的故事的寓意是,如果ddply以您不希望的顺序输出某些内容,则表明您应该对要拆分的变量使用有序因子,这是一个好兆头.

The moral of the story here is that if ddply is outputting something in an order you didn't intend, it's a good sign that you should be using ordered factors for the variables you're splitting on.

这篇关于如何在不排序的情况下ddply()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆