r中的相对频率乘以系数 [英] Relative frequency in r by factor

查看:154
本文介绍了r中的相对频率乘以系数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想得到一个跨其他因子变量的前10个绝对频率和相对频率的表格. 我有一个包含3列的数据框:1列是因子变量,第二列是我需要计数的其他变量,3列是约束的逻辑变量. (实际数据库中有超过4百万个观测值)

I would like to get a table of top 10 absolute and relative frequencies for a variable across other factor variable. I have a dataframe with 3 columns: 1 column is a factor variable, 2nd is other variable I need to count, 3 is logical variable as a constraint. (real database has more than 4mln observations)

dtf<-data.frame(c("a","a","b","c","b"),c("aaa","bbb","aaa","aaa","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf

factor var   log
1      a aaa  TRUE
2      a bbb FALSE
3      b aaa  TRUE
4      c aaa  TRUE
5      b bbb  TRUE

因此,我需要在每个"factor"因子中找到"var"的最高绝对频率和相对频率,其中"log" == TRUE.

So I need to find top absolute and relative frequencies of "var" where "log"==TRUE across each factor of "factor".

我已经用绝对频率尝试过此操作(在实际数据库中,我提取了前10个,这里有2行):

I've tried this with absolute frequencies (in real db I extract top 10, here I get 2 lines):

t1<-tapply(dtf$var[dtf$log==T],dtf$factor[dtf$log==T],function(x)(head(sort(table(x),decreasing=T),n=2L)))
# Returns array of lists: list of factors containing list of top frequencies
t2<-(t1, ldply)
# Split list inside by id and freq
t3<-do.call(rbind, lapply(t2, data.frame))
# Returns dataframe of top "var" values and corresponding freq for each group in "factor"
# Factor variable's labels are saved as row.names in t3

以下功能有助于查找整个数据库的相对频率,而不是按因素分组:

The following function helps to find relative frequency as for the whole database, not grouped by factors:

getrelfreq<-function(x){
v<-table(x)
v_rel<-v/nrow(dtf[dtf$log==T,])
head(sort(v_rel,decreasing=T),n=2L)}

但是我遇到了相对频率的问题,因为我需要将绝对频率除以"var"的行数除以每个因子,而不是"var"的总行数,其中"log" == T.我不知道如何在Tapply循环中使用它,以使分母对于每个因素都不同. 我也想在1个tapply循环中使用这两个函数,而不是生成许多表并合并结果.但是不知道如何将这两个功能放在一起.

But I have problems with relative frequencies as I need to divide the absolute frequency by number of rows of "var" BY EACH factor, not TOTAL nrow of "var" where "log"==T. I don't know how to use that in tapply loop such that the denominator will be different for each factor. I also would like to use both functions in 1 tapply loop instead of generating many tables and merging results. But have no idea how to put such 2 functions together.

推荐答案

如果我对您的理解正确,则可以执行以下内容.使用dcast获取每个factor上每个var的频率,然后使用rowSums()将它们相加以获得所有因素上每个var的绝对频率.您可以使用prop.table来计算每个factor中每个var的相对频率.请注意,我对示例数据进行了一些更改,以便您可以跟踪每个阶段发生的情况(我在log == TRUE时为factor b添加了'bbb'值).试试这个:

If I understand you correctly you can do something like what I have written below. Use dcast to get the frequencies of each var across each factor, then use rowSums() to add them up to get absolute frequencies for each var across all factors. You can use prop.table to work out the relative frequency of each var across each factor. Note I made a slight change to your example data so you can follow what is happening at each stage (I added a 'bbb' value for factor b when log == TRUE ). Try this:

#Data frame (note 2 values for 'bbb' for factor 'b' when log == TRUE)
dtf<-data.frame(c("a","a","b","c","b","b"),c("aaa","bbb","aaa","aaa","bbb","bbb"),c(TRUE,FALSE,TRUE,TRUE,TRUE,TRUE))
colnames(dtf)<-c("factor","var","log")
dtf
#     factor var   log
#1      a aaa  TRUE
#2      a bbb FALSE
#3      b aaa  TRUE
#4      c aaa  TRUE
#5      b bbb  TRUE
#6      b bbb  TRUE


library(reshape2)

# Find frequency of each var across each factor using dcast
mydat <- dcast( dtf[dtf$log==TRUE , ] , var ~ factor , sum )
#  var a b c
#1 aaa 1 1 1
#2 bbb 0 2 0

# Use rowSums to find absolute frequency of each var across all groups
mydat$counts <- rowSums( mydat[,-1] )
# Order by decreasing frequency and just use first 10 rows
mydat[ order( mydat$counts , decreasing = TRUE ) , ]
#  var a b c counts
#1 aaa 1 1 1      3
#2 bbb 0 2 0      2


# Relative proportions for each var across the factors
data.frame( var = mydat$var , round( prop.table( as.matrix( mydat[,-c(1,ncol(mydat))]) , 1 ) , 2 ) )
#  var    a    b    c
#1 aaa 0.33 0.33 0.33
#2 bbb 0.00 1.00 0.00

这篇关于r中的相对频率乘以系数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆