从数据帧中按组查找顶部十进制数 [英] Find top deciles from dataframe by group

查看:130
本文介绍了从数据帧中按组查找顶部十进制数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用函数创建新变量,而不是使用循环在数据中正确使用 lapply 。我曾经使用Stata,并且会使用类似于这里



由于以编程方式命名变量在R中非常困难或至少令人尴尬(并且您似乎无法使用索引为 assign ),我已经离开了命名过程,直到 lapply 之后。然后我使用循环在合并之前进行重命名,并再次合并。有更有效的方法吗?如何更换循环?我应该做某种重组吗?

 #可复制数据
data< - data.frame(custID = c(1:10,1:20),
v1= rep(c(A,B),c(10,20)),
v2= c(30:21,20:19,1:3,20:6),stringsAsFactors = TRUE)

#分析每个类别(v1)的客户分布的功能
pf< - function(cat,df){

df < - df [df $ v1 == cat,]
df < - df [order(-df $ v2),]

#将客户转化为最高百分比
nr< - nrow(df)
p10 < - round(nr * .10,0)
cat( 10%中的人数:,p10,\\\

p20< - round(nr * .20,0)
p11_20 < - p20-p10
cat(11-20%中的人数,p11_20,\\\


#仅顶部组中的客户
df< - df [1:p20,]

#创建一个变量来标识客户在
中的百分比组top_pct< - integer(length = p10 + p11_20)

#识别每个组中的那些
top_pct [1:p10]< - 10
top_pct [(p10 + 1):p20]< - 20

#将这个变量添加到数据框
df $ top_pct< - top_pct

#仅保留custID和新变量
df< - subset(df ,select = c(custID,top_pct))

return(df)

}


##运行客户分发函数
v1Levels< - levels(data $ v1)
res< - lapply(v1Levels,pf,df = data)

#Explore结果
summary res)

#长度类模式
#[1,] 2 data.frame list
#[2,] 2 data.frame list

print(res)

#[[1]]
#custID top_pct
#1 1 10
#2 2 20

#[[2]]
#custID top_pct
#11 1 10
#16 6 10
#12 2 20
#17 7 20



## ge两个数据帧,但是top_pct作为每个类别的不同变量

#更改新的变量名称
for(i in 1:length(res)){
names (res [[i]])[2]< - paste0(v1Levels [i],_top_pct)
}

#Merge结果
res_m<对于(i in 2:length(res))的
{
res_m< - merge(res_m,res [[i]],by =custID,all = TRUE )
}

打印(res_m)

#custID A_top_pct B_top_pct
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20


解决方案

在R中做这种事情的惯用方式是使用 split lapply 。您的中途是您使用 lapply ;你只需要使用 split

  lapply(split数据,数据$ v1),函数(df){
截止< - 分位数(df $ v2,c(0.8,0.9))
top_pct< - ifelse(df $ v2& 2],10,ifelse(df $ v2> cutoff [1],20,NA))
na.omit(data.frame(id = df $ custID,top_pct))
})

使用 quantile 查找分位数。 p>

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.

Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?

#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
    "v1" = rep(c("A", "B"), c(10,20)), 
    "v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)

#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {

        df <- df[df$v1 == cat,]
        df <- df[order(-df$v2),]

    #Divide the customers into top percents
    nr <- nrow(df)
    p10 <- round(nr * .10, 0)
    cat("Number of people in the Top 10% :", p10, "\n")
    p20 <- round(nr * .20, 0)
    p11_20 <- p20-p10
    cat("Number of people in the 11-20% :", p11_20, "\n")

    #Keep only those customers in the top groups
    df <- df[1:p20,]

    #Create a variable to identify the percent group the customer is in
    top_pct <- integer(length = p10 + p11_20)

    #Identify those in each group
    top_pct[1:p10] <- 10
    top_pct[(p10+1):p20] <- 20

    #Add this variable to the data frame
    df$top_pct <- top_pct

    #Keep only custID and the new variable
    df <- subset(df, select = c(custID, top_pct))

    return(df)

}


##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)

#Explore the results
summary(res)

    #      Length Class      Mode
    # [1,] 2      data.frame list
    # [2,] 2      data.frame list

print(res)

    # [[1]]
    #   custID top_pct
    # 1      1      10
    # 2      2      20
    # 
    # [[2]]
    #    custID top_pct
    # 11      1      10
    # 16      6      10
    # 12      2      20
    # 17      7      20



##Merge the two data frames but with top_pct as a different variable for each category

#Change the new variable name
for(i in 1:length(res)) {
    names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}

#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
    res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}

print(res_m)

    #   custID A_top_pct B_top_pct
    # 1      1        10        10
    # 2      2        20        20
    # 3      6        NA        10
    # 4      7        NA        20

解决方案

The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.

lapply(split(data, data$v1), function(df) {
    cutoff <- quantile(df$v2, c(0.8, 0.9))
    top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
    na.omit(data.frame(id=df$custID, top_pct))
})

Finding quantiles is done with quantile.

这篇关于从数据帧中按组查找顶部十进制数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆