R:在数据帧中组合相同的标识符 [英] R: combine same identifiers in dataframe
问题描述
ID名称
uc001aag.1 DKFZp686C24272
uc001aag.1 DQ786314
uc001aag.1 uc001aag.1
uc001aah.2 AK056232
uc001aah.2 FLJ00038
uc001aah.2 uc001aah.1
uc001aah.2 uc001aah.2
uc001aai.1 AY217347
现在我想创建一个这样的数据框:
ID名称
uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1 AY217347
任何人都可以帮我吗?
聚合相当快,但您可以使用sapply解决方案来并行化代码。这可以在Windows上使用降雪来轻松完成
:
require )
sfInit(parallel = TRUE,cpus = 2)
sfExport(Data)
ID< - unique(Data $ ID)
CombNames& - sfSapply(ID,function(i){
paste(Data $ Names [Data $ ID == i],collapse =|)
})
data.frame(ID, CombNames)
sfStop()
并行版本会给你一个额外的加速,但是单一的解决方案实际上比聚合体慢。 Tapply有点快,但不能使用降雪并行化。在我的电脑上:
n < - 3000
m < - 3
数据< - 数据.frame(ID = rep(1:n,m),
Names = rep(LETTERS [1:m],each = n))
#使用降雪并行sapply
系统。时间({
ID< - unique(Data $ ID)
CombNames< - sfSapply(ID,function(i){
贴(Data $ Names [Data $ ID == i ],collapse =|)
})
data.frame(ID,CombNames)
})
用户系统已用
0.02 0.00 0.33
#使用自定义
system.time({
CombNames< - - - - - - - - ID =名称(CombNames),CombNames)
})
用户系统已用
0.44 0.00 0.44
#使用汇总
system.time(
aggregate(Names〜ID,data = Data,FUN = paste,collapse =|)
)
用户系统已用
0.47 0.00 0.47
#使用正常的sapply
system.time({
ID < - unique(Da ta $ ID $($)
CombNames< - sapply(ID,function(i){
贴(Data $ Names [Data $ ID == i],collapse =| )
})
data.frame(ID,CombNames)
})
用户系统已用
0.75 0.00 0.75
注意:
为了记录,更好的sapply-solution将是:
CombNames< - sapply(split(Data $
data.frame(ID =名称(CombNames),CombNames)
这相当于自动填充,但并行化这一点实际上更慢,因为您必须在 sfSapply
之内移动更多的数据。速度来自将数据集复制到每个cpu,这是您在数据集巨大时您必须牢记的:您将以更多的内存使用率支付速度。
I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).
ID Names
uc001aag.1 DKFZp686C24272
uc001aag.1 DQ786314
uc001aag.1 uc001aag.1
uc001aah.2 AK056232
uc001aah.2 FLJ00038
uc001aah.2 uc001aah.1
uc001aah.2 uc001aah.2
uc001aai.1 AY217347
Now I want to create a dataframe like this:
ID Names
uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1 AY217347
Can anyone help me?
Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall
:
require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()
The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :
n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
Names=rep(LETTERS[1:m],each=n))
# using snowfall for parallel sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.02 0.00 0.33
# using tapply
system.time({
CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
})
user system elapsed
0.44 0.00 0.44
# using aggregate
system.time(
aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
)
user system elapsed
0.47 0.00 0.47
# using the normal sapply
system.time({
ID <- unique(Data$ID)
CombNames <- sapply(ID,function(i){
paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
})
user system elapsed
0.75 0.00 0.75
Note:
For the record, the better sapply-solution would be :
CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)
which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply
. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.
这篇关于R:在数据帧中组合相同的标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!