根据列中的常用值将大型数据帧拆分为数据帧列表 [英] Split a large dataframe into a list of data frames based on common value in column
问题描述
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
4 | aad | bb4 | ... | u_002
导致
list(
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | .. 。| u_001
3 | aac | bb3 | ... | u_001
,
4 | aad | bb4 | ... | u_002
...)
以下内容对我的小样本(1000行)非常有用:
paths = by(smallsampleMat,smallsampleMat [,userID],function(x)x)
然后通过路径[1]访问我想要的元素。
在原始大数据帧甚至矩阵表示,这阻碍了我的机器(4GB RAM,MacOSX 10.6,R 2.15),从未完成(我知道有一个较新的R版本存在,但我相信这不是主要的问题)。
似乎拆分更好,在很长一段时间后完成,但我不知道(劣R知识)如何将结果的向量列表分成一个向量的矩阵
path = split(smallsampleMat,smallsampleMat [,10])
我也考虑过使用 big.matrix
等,但没有太多的成功,会加快
您可以轻松访问列表中的每个元素,例如 path [[1]]
。您不能将一组矩阵放入原子向量并访问每个元素。矩阵是具有维属性的原子向量。我将使用 split
返回的列表结构,它是为它设计的。每个列表元素可以容纳不同类型和大小的数据,因此它非常灵活,您可以使用 * apply
函数来进一步操作列表中的每个元素。
#对于可复制数据
set.seed(1)
#做一些数据
userid< - rep(1:2,times = 4)
data1< - replicate(8,paste(sample(letters,3),collapse =))
data2< - sample(10,8)
df< - data.frame(userid,data1,data2)
#在userid上拆分
out< - split(df,f = df $ userid)
#$`1`
#userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
# 5 1 rjs 6
#7 1 jtw 5
#$`2`
#userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9
访问每个元素使用 [[
运算符如下:
out [[ 1]]
#userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
或使用 *应用
函数对每个列表元素进行进一步的操作。例如,要采用 data2
列的平均值,您可以像这样使用:
code> sapply(out,function(x)mean(x $ data2))
#1 2
#3.75 6.25
I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
4 | aad | bb4 | ... | u_002
resulting into
list(
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
,
4 | aad | bb4 | ... | u_002
...)
The following works very well for me on a small sample (1000 rows):
paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)
and then accessing the element I want by paths[1] for instance.
When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).
It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.
path = split(smallsampleMat, smallsampleMat[,10])
I have considered also using big.matrix
etc, but without much success that would speed up the process.
You can just as easily access each element in the list using e.g. path[[1]]
. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split
, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply
functions to further operate on each element in the list. Example below.
# For reproducibile data
set.seed(1)
# Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )
# Split on userid
out <- split( df , f = df$userid )
#$`1`
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
#$`2`
# userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9
Access each element using the [[
operator like this:
out[[1]]
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
Or use an *apply
function to do further operations on each list element. For instance, to take the mean of the data2
column you could use sapply like this:
sapply( out , function(x) mean( x$data2 ) )
# 1 2
#3.75 6.25
这篇关于根据列中的常用值将大型数据帧拆分为数据帧列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!