在数据帧中使用列值的频率来计算新的列值 [英] Using frequency of column value in dataframe to calculate new column value

查看:88
本文介绍了在数据帧中使用列值的频率来计算新的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个例子数据框,保存列id,count和用户名,id和count是数字,username是一个字符串。

So I have an example dataframe that hold the columns id, count and username with id and count being numbers and username being a string.

对于数据框我想设置一个名为ratio的新列的值,其比例定义为

For every row of the dataframe I want to set a value of a new column called 'ratio', with ratio being defined as


count /行数,其中username ==此行中的用户名

count / number of rows where username == the username in this row

提供的数据示例:

在用户名的每一行Tom的比例将是count / 4,因为用户在数据中找到了四次。

Example from the provided data:
In every row where the username is 'Tom' the ratio would be count/4 , because the user Tom is found four times in the data.

这只是我的问题的简化版本, -loop不是一个选择,因为我的原始数据框有大约340万行,而我以前的方法,我使用循环来迭代例如'用户名'解决这个问题需要永远。

This is just a simplified version of my problem, a for-loop is not an option because my original dataframe has about 3.4 million rows and my previous approach where I used for-loops to iterate the unique values of e.g. 'username' to solve this problem takes forever.

我的数据框dput:

structure(list(id = 1:20, count = c(140L, 89L, 17L, 114L, 129L, 
86L, 21L, 50L, 197L, 160L, 8L, 14L, 78L, 208L, 155L, 55L, 63L, 
20L, 189L, 79L), usernames = structure(c(4L, 3L, 5L, 5L, 2L, 
3L, 1L, 1L, 3L, 1L, 3L, 2L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 3L), .Label = c("Jerry", 
"Mark", "Phil", "Tina", "Tom"), class = "factor")), .Names = c("id", 
"count", "usernames"), row.names = c(NA, 20L), class = "data.frame")

我希望我提供一切,让您了解和重现问题,如果缺少的事情毫不犹豫地提及

I hope I provided everything for you to understand and reproduce the problem, if something's missing don't hesitate to mention it in the comments.

推荐答案

有几个选项。这里有三个,一个在基础R中,一个有 data.table ,一个带有plyr。两者都假设我们从名为mydf的数据框架开始:

There are several options. Here are three, one in base R, one with data.table, and one with "plyr". Both assume we're starting with a data.frame named "mydf":

within(mydf, {
  temp <- as.numeric(ave(as.character(usernames), usernames, FUN = length))
  ratio <- count/temp
  rm(temp)
})



data.table



data.table

library(data.table)
DT <- data.table(mydf)
DT[, ratio := count/.N, by = "usernames"]
DT



plyr


library(plyr)
ddply(mydf, .(usernames), transform,
      ratio = count/length(usernames))

这篇关于在数据帧中使用列值的频率来计算新的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆