Intradataframe分析 - 从另一个数据帧创建派生数据帧 [英] Intradataframe Analysis--creating a derivative data frame from another data frame

查看:120
本文介绍了Intradataframe分析 - 从另一个数据帧创建派生数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个问题标题的一点点,因为我还在加快R的速度,但我正在做一些数据框架操作,以提取某些百分比分组,这些分组是由一列捕获的因素反对另一列我希望获得百分比。我将使用内建的mtcars来展示我正在努力实现的目标,其中齿轮扮演分类变量的角色,而cyl是我试图从中获取百分比的数据。

This may be a little obtuse of a question title since I'm still getting up to speed with R but I'm doing some data frame manipulation to extract certain percentages regarding classification groups that are captured by one column that is a factor against another column I wish to obtain percentages from. I'll use the built in mtcars to demonstrate what I'm trying to achieve, where gear is playing the role of the classification variable, and cyl is the data I'm trying to get percentages from.

只是一些背景细节来平滑问题:

Just some background details to smooth the question:

齿轮列跨度3不同的值, 3,4,5
cyl 列跨越3个不同的值, 4,6,8

The gear column spans 3 distinct values, 3,4,5. The cyl column spans 3 distinct values as well, 4,6,8

我的列表中的第一个元素表示齿轮类型的百分比最多为4个气缸。对于三档车型,丰田电晕只有一个,共有15个三档车型,因此百分比应为1/15 = 0.0667。对于4齿轮型号,共有12个4齿轮型号中有8个,产生8/12 = 0.667。

The first element of my list says what percentage of gear types have at most 4 cylinders. For 3-gear models there is only one, the Toyota Corona, out of a total of 15 3-gear models, and thus the percentage should be 1/15 = 0.0667. For 4-gear models there are eight out of a total of 12 4-gear models, to yield 8/12 = 0.667.

现在这里是我写的方法这个计算。然而,输出结构不是我所期望的。我想要的是将这一切合并到数据框架中,第一列是不同的cyl值,其他列是齿轮类型的3,4和5,其中行是不同的百分比。我非常接近,但需要一些帮助,使我正在实现的列表的数据重塑,或者甚至可以运用替代应用功能,将实现我追逐的百分比表,或任何其他魔术有人可以做饭。

Now here's the method I wrote to do this computation. However the structure of the output is not what I desire. What I'd like instead is to merge this all into a data frame with the first column being the distinct cyl values and the other columns being the 3, 4, and 5 for the gear types, where the rows are the various percentages. I'm very close but need some help doing the data reshaping of the list I am currently achieving or maybe even exercising an alternative apply function that will achieve the table of percentages I'm chasing after, or any other magic someone can cook up.

>  lapply( unique( sort( y$cyl ) ) , function(c) { tapply( y$cyl , y$gear , function(x) sum( x <= c ) / length(x) ) } ) 
[[1]]
         3          4          5 
0.06666667 0.66666667 0.40000000 

[[2]]
  3   4   5 
0.2 1.0 0.6 

[[3]]
3 4 5 
1 1 1 

这是我们期望的数据框架,我希望显示为

This is what we could expect the data frame I desire to appear as

  cyl         X3        X4  X5
1   4 0.06666667 0.6666667 0.4
2   6 0.20000000 1.0000000 0.6
3   8 1.00000000 1.0000000 1.0


推荐答案

我想出了一个解决方案,在谷歌搜索将列表数组转换成data.frame,它立即引导我到以下SO post

I came up with a solution after googling "convert list of arrays into data.frame", which immediately lead me to the following SO post.

p <- lapply( unique( sort( mtcars$cyl ) ) , function(c) { tapply( mtcars$cyl , mtcars$gear , function(x) sum( x <= c ) / length(x) ) } )

> df <- data.frame( matrix( unlist(p) , nrow = length(p) , byrow=T ) )
> df
          X1        X2  X3
1 0.06666667 0.6666667 0.4
2 0.20000000 1.0000000 0.6
3 1.00000000 1.0000000 1.0

解决方案除了将分类名称作为列标题删除之外,但它看起来像跟进分配一样可以恢复...

The solution works apart from the dropping of the classification names as the column headers, but it looks like with a follow up assignment this can be recovered as well...

> colnames(df) <- names(p[[1]])
> rownames(df) <- unique( sort( mtcars$cyl ) )
> df
           3         4   5
4 0.06666667 0.6666667 0.4
6 0.20000000 1.0000000 0.6
8 1.00000000 1.0000000 1.0

实际上,链接问题的其他答案很好地解决了列头问题,行标题问题依然存在,因为这些值在我的匿名函数调用中丢失。

Actually, other answers to the linked question nicely address the column headers issue, the row header problem remains since those values get lost in my anonymous function calls.

这篇关于Intradataframe分析 - 从另一个数据帧创建派生数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆