从数据帧中按组查找顶部十进制数 [英] Find top deciles from dataframe by group

查看：130 发布时间：2017/3/26 0:53:48 r dataframe rank quantile split-apply-combine

本文介绍了从数据帧中按组查找顶部十进制数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用函数创建新变量，而不是使用循环在数据中正确使用 lapply 。我曾经使用Stata，并且会使用类似于这里。

由于以编程方式命名变量在R中非常困难或至少令人尴尬（并且您似乎无法使用索引为 assign ），我已经离开了命名过程，直到 lapply 之后。然后我使用来循环在合并之前进行重命名，并再次合并。有更有效的方法吗？如何更换循环？我应该做某种重组吗？

 ＃可复制数据
 data<  -  data.frame（custID = c（1:10，1:20），
v1= rep（c（A，B），c（10,20）），
v2= c（30:21,20:19,1：3,20：6），stringsAsFactors = TRUE）
 
＃分析每个类别（v1）的客户分布的功能
 pf< -  function（cat，df）{
 
 df < -  df [df $ v1 == cat，] 
 df < -  df [order（-df $ v2），] 
 
＃将客户转化为最高百分比
 nr<  -  nrow（df）
 p10 < -  round（nr * .10，0）
 cat（ 10％中的人数：，p10，\\\
）
 p20<  -  round（nr * .20，0）
 p11_20 < -  p20-p10 
 cat（11-20％中的人数，p11_20，\\\
）
 
＃仅顶部组中的客户
 df<  -  df [1：p20，] 
 
＃创建一个变量来标识客户在
中的百分比组top_pct<  -  integer（length = p10 + p11_20） 
 
＃识别每个组中的那些
 top_pct [1：p10]<  -  10 
 top_pct [（p10 + 1）：p20]<  -  20 
 
＃将这个变量添加到数据框
 df $ top_pct<  -  top_pct 
 
＃仅保留custID和新变量
 df<  -  subset（df ，select = c（custID，top_pct））
 
 return（df）
 
} 
 
 
 ##运行客户分发函数
 v1Levels<  -  levels（data $ v1）
 res<  -  lapply（v1Levels，pf，df = data）
 
 #Explore结果
 summary res）
 
＃长度类模式
＃[1，] 2 data.frame list 
＃[2，] 2 data.frame list 
 
 print（res）
 
＃[[1]] 
＃custID top_pct 
＃1 1 10 
＃2 2 20 
＃
 ＃[[2]] 
＃custID top_pct 
＃11 1 10 
＃16 6 10 
＃12 2 20 
＃17 7 20 
 
 
 
 ## ge两个数据帧，但是top_pct作为每个类别的不同变量
 
＃更改新的变量名称
 for（i in 1：length（res））{
 names （res [[i]]）[2]<  -  paste0（v1Levels [i]，_top_pct）
} 
 
 #Merge结果
 res_m<对于（i in 2：length（res））的
 {
 res_m<  -  merge（res_m，res [[i]]，by =custID，all = TRUE ）
} 
 
打印（res_m）
 
＃custID A_top_pct B_top_pct 
＃1 1 10 10 
＃2 2 20 20 
＃3 6 NA 10 
＃4 7 NA 20

解决方案

在R中做这种事情的惯用方式是使用 split 和 lapply 。您的中途是您使用 lapply ;你只需要使用 split 。

  lapply（split数据，数据$ v1），函数（df）{
截止< - 分位数（df $ v2，c（0.8，0.9））
 top_pct<  -  ifelse（df $ v2& 2]，10，ifelse（df $ v2> cutoff [1]，20，NA））
 na.omit（data.frame（id = df $ custID，top_pct））
}）

使用 quantile 查找分位数。 p>

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.

Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?

#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
    "v1" = rep(c("A", "B"), c(10,20)), 
    "v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)

#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {

        df <- df[df$v1 == cat,]
        df <- df[order(-df$v2),]

    #Divide the customers into top percents
    nr <- nrow(df)
    p10 <- round(nr * .10, 0)
    cat("Number of people in the Top 10% :", p10, "\n")
    p20 <- round(nr * .20, 0)
    p11_20 <- p20-p10
    cat("Number of people in the 11-20% :", p11_20, "\n")

    #Keep only those customers in the top groups
    df <- df[1:p20,]

    #Create a variable to identify the percent group the customer is in
    top_pct <- integer(length = p10 + p11_20)

    #Identify those in each group
    top_pct[1:p10] <- 10
    top_pct[(p10+1):p20] <- 20

    #Add this variable to the data frame
    df$top_pct <- top_pct

    #Keep only custID and the new variable
    df <- subset(df, select = c(custID, top_pct))

    return(df)

}


##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)

#Explore the results
summary(res)

    #      Length Class      Mode
    # [1,] 2      data.frame list
    # [2,] 2      data.frame list

print(res)

    # [[1]]
    #   custID top_pct
    # 1      1      10
    # 2      2      20
    # 
    # [[2]]
    #    custID top_pct
    # 11      1      10
    # 16      6      10
    # 12      2      20
    # 17      7      20



##Merge the two data frames but with top_pct as a different variable for each category

#Change the new variable name
for(i in 1:length(res)) {
    names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}

#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
    res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}

print(res_m)

    #   custID A_top_pct B_top_pct
    # 1      1        10        10
    # 2      2        20        20
    # 3      6        NA        10
    # 4      7        NA        20

解决方案

The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.

lapply(split(data, data$v1), function(df) {
    cutoff <- quantile(df$v2, c(0.8, 0.9))
    top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
    na.omit(data.frame(id=df$custID, top_pct))
})

Finding quantiles is done with quantile.

这篇关于从数据帧中按组查找顶部十进制数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从数据帧中按组查找顶部十进制数 [英] Find top deciles from dataframe by group

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从数据帧中按组查找顶部十进制数 [英] Find top deciles from dataframe by group

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭