按因子列汇总混合数据 [英] Aggregating mixed data by factor column

查看:194
本文介绍了按因子列汇总混合数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去的一周里,我一直试图聚合我的数据集,这个数据集由不同月份的不同体重测量数据组成,并伴随着大量的背景变量R.< p

我已阅读了许多有关此主题的不同问题(即通过定义分组来 R聚合数据如何聚合计数R 中的分类变量的唯一值),但它们似乎都只能用于一种类型的数据,或者只对一列感兴趣。具体而言,将分类变量重新编码为最常见的值交易与几乎完全相同的问题,但提出的答案只能解决分类数据的问题,它不包括数字数据。我的数据由两个因素(分类和序数)和数字数据组成。



可重现的例子是:

<$ p (1,1,1,2,2,3,3,3)$ b $ b性别< -c(男性,男性,男性,女性,女性,女性,女性,女性)
重量<-c(80 (Yes,No,No,Yes,Yes,Yes,82,是,NA)
df = data.frame(ID号,性别,体重,LikesSoda)

我的输出数据框将采用每个数字列的平均值,以及每个因子列的最常见因子。在例子中,这看起来如下:

  ID号码<-c(1,2,3 )
性别< -c(男性,女性,女性)
权重<-c(81.5,78,52)
LikeSoda <-c(否,是,是)
输出= data.frame(ID号码,性别,体重,LikesSoda)

到目前为止,我试图将数据框分成一个因子数据框和数字数据框,并使用两个具有不同功能的聚合(数字的意思,但我一直无法找到工作函数为分类数据)。另一种选择是使用dplyr df&>& group_by(IDnumber)%>%汇总(每个变量的转换)代码,但是这需要我指定如何手动处理每列。因为我有超过2500列,这似乎不是一个可行的解决方案。

你可以编写自己的函数,然后使用 lapply 。首先,编写一个函数来查找因子变量中最频繁的级别。

$ $ $ getmode< - function(v){
levels(v)[which.max(table(v))]
}

然后编写一个函数,根据传递给它的变量的类型返回mean或mode

  my_summary<  -  function (x,id,...){
if(is.numeric(x)){
return(tapply(x,id,mean))
}
if( is.factor(x)){
return(tapply(x,id,getmode))
}
}

最后,使用 lapply 来计算汇总

  data.frame(lapply(df,my_summary,id = df $ IDnumber))
IDnumber性别体重LikesSoda
1 1男性81.33333否
2女性68.00000是
3 3女性52.00000是

如果可能有两个或更多级别相同的最大频率t母鸡 which.max 只会返回第一个。我从你的评论中了解到,你只是想知道它们中有多少个,所以一个选项可能会稍微修改 getmode 函数,因此它会为其添加一个星号(b)

  getmode<  -  function(v){
tab< - table( v)
if(sum(tab%in%max(tab))> 1)return(paste(levels(v)[which.max(tab)],'*'))
levels (v)[which.max(tab)]
}

(更改您的示例数据所以有一个女性和一个男性IDnumber ==2)

  data.frame(lapply(df,my_summary, )

IDnumber性别体重LikesSoda
1 1男性81.33333否
2 2女* 68.00000是
3 3女52.00000是

恐怕这是一个混乱的'解决方案',但如果你只是想了解这个问题有多普遍,或许它就足够了你的需要。


For the past week I have been trying to aggregate my dataset that consists of different weight measurements in different months accompanied by a large volume of background variables in R.

I have read many different asked questions on this topic (i.e. R aggregate data by defining grouping, How to aggregate count of unique values of categorical variables in R), but they all seem to either only work with one type of data or are only interested in one column. Specifically, question Recoding categorical variables to the most common value deals with almost exactly the same problem, but the proposed answer only fixes the problem for the categorical data, it does not include the numeric data as well. My data consist of both factors(categorical and ordinal) and numeric data.

The reproducible example is:

IDnumber <- c("1", "1", "1", "2", "2", "3", "3", "3")
Gender <- c("Male", "Male", "Male", "Female", "Female", "Female", "Female",  "Female")
Weight <- c(80, 82, 82, 70, 66, 54, 50, 52)
LikesSoda <- c("Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", NA)
df = data.frame(IDnumber, Gender, Weight, LikesSoda)

My output dataframe would take the mean of each numerical column, and the most frequent factor for each factor column. In the example this would look as following:

IDnumber <- c("1", "2", "3")
Gender <- c("Male", "Female", "Female")
Weight <- c(81.5, 78, 52)
LikesSoda <- c("No", "Yes", "Yes")
output = data.frame(IDnumber, Gender, Weight, LikesSoda)

So far I've tried to split the dataframe into a factor dataframe and numeric dataframe and use two aggregates with a different function (mean for the numeric, but I've not been able to find a working function for the categorical data). The other option is to use a dplyr df &>& group_by(IDnumber) %>% summarise( transformation for each variable ) code, but that requires me to specify how to handle each column manually. Since I have over 2500 columns, this does not seem like a workable solution.

解决方案

You could write your own functions and then use lapply. First, write a function to find the most frequent level in a factor variable

getmode <- function(v) {
  levels(v)[which.max(table(v))]
}

Then write a function to return either the mean or mode depending on the type of variable passed to it

my_summary <- function(x, id, ...){
  if (is.numeric(x)) {
    return(tapply(x, id, mean))
  }  
  if (is.factor(x)) {
    return(tapply(x, id, getmode))
  }  
}

Finally, use lapply to calculate the summaries

data.frame(lapply(df, my_summary, id = df$IDnumber))
  IDnumber Gender   Weight LikesSoda
1        1   Male 81.33333        No
2        2 Female 68.00000       Yes
3        3 Female 52.00000       Yes

If there might be two or more levels in a factor with the same, maximum frequency then which.max will just return the first one. I understand from your comment that you just want to know how many of them there are, so one option might be to amend the getmode function slightly, so it adds an asterisk to the level when there is a tie:

getmode <- function(v) {
  tab <- table(v)
  if (sum(tab %in% max(tab)) > 1)  return(paste(levels(v)[which.max(tab)], '*'))
  levels(v)[which.max(tab)]
}

(Changing your sample data so there is one Female and one Male with IDnumber == "2")

data.frame(lapply(df, my_summary, id = df$IDnumber))

  IDnumber   Gender   Weight LikesSoda
1        1     Male 81.33333        No
2        2 Female * 68.00000       Yes
3        3   Female 52.00000       Yes

I'm afraid that's a bit of a messy 'solution', but if you just want to get an idea of how common that issue is, perhaps it will be sufficient for your needs.

这篇关于按因子列汇总混合数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆