根据列中的公共值将大型数据框拆分为数据框列表 [英] Split a large dataframe into a list of data frames based on common value in column

查看:21
本文介绍了根据列中的公共值将大型数据框拆分为数据框列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 10 列的数据框,收集用户"的操作,其中一列包含一个 ID(不是唯一的,识别用户)(第 10 列).数据框的长度约为 750000 行.我试图提取由包含用户"标识符的列拆分的单个数据帧(因此获取数据帧的列表或向量),以隔离单个参与者的动作.

I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.

ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
4  | aad   | bb4   | ... | u_002

导致

list(
ID | Data1 | Data2 | ... | UserID
1  | aaa   | bbb   | ... | u_001
2  | aab   | bb2   | ... | u_001
3  | aac   | bb3   | ... | u_001
,
4  | aad   | bb4   | ... | u_002
...)

以下对小样本(1000 行)非常有效:

The following works very well for me on a small sample (1000 rows):

paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)

然后例如通过路径[1]访问我想要的元素.

and then accessing the element I want by paths[1] for instance.

在应用原始大数据帧甚至矩阵表示时,这会阻塞我的机器(4GB RAM、MacOSX 10.6、R 2.15)并且永远不会完成(我知道存在较新的 R 版本,但我相信这不是主要问题).

When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).

似乎 split 的性能更高,并且在很长时间后完成,但我不知道(较差的 R 知识)如何将结果向量列表拼凑成矩阵向量.

It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.

path = split(smallsampleMat, smallsampleMat[,10]) 

我也考虑过使用 big.matrix 等,但没有太多成功可以加快进程.

I have considered also using big.matrix etc, but without much success that would speed up the process.

推荐答案

您可以使用例如轻松访问列表中的每个元素路径[[1]].您不能将一组矩阵放入原子向量中并访问每个元素.矩阵是具有维度属性的原子向量.我会使用 split 返回的列表结构,这就是它的设计目的.每个列表元素都可以保存不同类型和大小的数据,因此它非常通用,您可以使用 *apply 函数进一步操作列表中的每个元素.示例如下.

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile data
set.seed(1)

#  Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )

#  Split on userid
out <- split( df , f = df$userid )
#$`1`
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

#$`2`
#  userid data1 data2
#2      2   xfv     4
#4      2   bfe    10
#6      2   mrx     2
#8      2   fqd     9

使用 [[ 运算符访问每个元素,如下所示:

Access each element using the [[ operator like this:

out[[1]]
#  userid data1 data2
#1      1   gjn     3
#3      1   yqp     1
#5      1   rjs     6
#7      1   jtw     5

或者使用 *apply 函数对每个列表元素做进一步的操作.例如,要取 data2 列的平均值,您可以像这样使用 sapply:

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) )
#   1    2 
#3.75 6.25 

这篇关于根据列中的公共值将大型数据框拆分为数据框列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆