用ddply分配组ID [英] Assigning group ID with ddply
问题描述
R新手提出的相当基本的性能问题.我想通过字段的唯一组合为数据框中的每一行分配一个组ID.这是我目前的方法:
Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:
> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"),
st.num=c("101", "102", "105", "102", "150"),
st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
name st.num st.name
1 Anne 101 Main
2 Bob 102 Elm
3 Chris 105 Park
4 Dan 102 Elm
5 Erin 150 Main
>
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df,
c("st.num", "st.name"),
function(x) transform(x, household=getString()))
> df
name st.num st.name household
1 Anne 101 Main 1EZWm4BQel
2 Bob 102 Elm xNaeuo50NS
3 Dan 102 Elm xNaeuo50NS
4 Chris 105 Park Ju1NZfWlva
5 Erin 150 Main G2gKAMZ1cU
虽然这对于行数相对较少或组数量较少的数据帧效果很好,但是我遇到了具有许多唯一组的较大数据集(> 100,000行)的性能问题.
While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.
是否有任何建议可以提高此任务的速度?可能是plyr的实验性idata.frame()?还是我要解决所有这些错误?
Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?
预先感谢您的帮助.
推荐答案
尝试使用id
函数(也在plyr中):
Try using the id
function (also in plyr):
df$id <- id(df[c("st.num", "st.name")], drop = TRUE)
更新:
从dplyr版本0.5.0开始,已不推荐使用id
函数.
函数group_indices
提供相同的功能.
The id
function is considered deprecated since dplyr version 0.5.0.
The function group_indices
provides the same functionality.
这篇关于用ddply分配组ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!