用ddply分配组ID [英] Assigning group ID with ddply

查看:83
本文介绍了用ddply分配组ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R新手提出的相当基本的性能问题.我想通过字段的唯一组合为数据框中的每一行分配一个组ID.这是我目前的方法:

Pretty basic performance question from an R newbie. I'd like to assign a group ID to each row in a data frame by unique combinations of fields. Here's my current approach:

> # An example data frame
> df <- data.frame(name=c("Anne", "Bob", "Chris", "Dan", "Erin"), 
                   st.num=c("101", "102", "105", "102", "150"), 
                   st.name=c("Main", "Elm", "Park", "Elm", "Main"))
> df
   name st.num st.name
1  Anne    101    Main
2   Bob    102     Elm
3 Chris    105    Park
4   Dan    102     Elm
5  Erin    150    Main
> 
> # A function to generate a random string
> getString <- function(size=10) return(paste(sample(c(0:9, LETTERS, letters), size, replace=TRUE), collapse=''))
>
> # Assign a random string for each unique street number + street name combination
> df <- ddply(df, 
              c("st.num", "st.name"), 
              function(x) transform(x, household=getString()))
> df
   name st.num st.name  household
1  Anne    101    Main 1EZWm4BQel
2   Bob    102     Elm xNaeuo50NS
3   Dan    102     Elm xNaeuo50NS
4 Chris    105    Park Ju1NZfWlva
5  Erin    150    Main G2gKAMZ1cU

虽然这对于行数相对较少或组数量较少的数据帧效果很好,但是我遇到了具有许多唯一组的较大数据集(> 100,000行)的性能问题.

While this works well for data frames with relatively few rows or a small number of groups, I run into performance problems with larger data sets ( > 100,000 rows) that have many unique groups.

是否有任何建议可以提高此任务的速度?可能是plyr的实验性idata.frame()?还是我要解决所有这些错误?

Any suggestions to improve the speed of this task? Possibly with plyr's experimental idata.frame()? Or am I going about this all wrong?

预先感谢您的帮助.

推荐答案

尝试使用id函数(也在plyr中):

Try using the id function (also in plyr):

df$id <- id(df[c("st.num", "st.name")], drop = TRUE)

更新:

从dplyr版本0.5.0开始,已不推荐使用id函数. 函数group_indices提供相同的功能.

The id function is considered deprecated since dplyr version 0.5.0. The function group_indices provides the same functionality.

这篇关于用ddply分配组ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆