R:在数据帧中组合相同的标识符 [英] R: combine same identifiers in dataframe

查看:104
本文介绍了R:在数据帧中组合相同的标识符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含2列的数据框,一个标识符和一个带有名称的列。每个标识符在列ID中存在多次(见下文)。

  ID名称
uc001aag.1 DKFZp686C24272
uc001aag.1 DQ786314
uc001aag.1 uc001aag.1
uc001aah.2 AK056232
uc001aah.2 FLJ00038
uc001aah.2 uc001aah.1
uc001aah.2 uc001aah.2
uc001aai.1 AY217347

现在我想创建一个这样的数据框:

  ID名称
uc001aag.1 DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2 AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1 AY217347

任何人都可以帮我吗?

解决方案

聚合相当快,但您可以使用sapply解决方案来并行化代码。这可以在Windows上使用降雪来轻松完成

  require )
sfInit(parallel = TRUE,cpus = 2)
sfExport(Data)

ID< - unique(Data $ ID)
CombNames& - sfSapply(ID,function(i){
paste(Data $ Names [Data $ ID == i],collapse =|)
})
data.frame(ID, CombNames)
sfStop()

并行版本会给你一个额外的加速,但是单一的解决方案实际上比聚合体慢。 Tapply有点快,但不能使用降雪并行化。在我的电脑上:

  n < -  3000 
m < - 3
数据< - 数据.frame(ID = rep(1:n,m),
Names = rep(LETTERS [1:m],each = n))
#使用降雪并行sapply
系统。时间({
ID< - unique(Data $ ID)
CombNames< - sfSapply(ID,function(i){
贴(Data $ Names [Data $ ID == i ],collapse =|)
})
data.frame(ID,CombNames)
})
用户系统已用
0.02 0.00 0.33

#使用自定义
system.time({
CombNames< - - - - - - - - ID =名称(CombNames),CombNames)
})
用户系统已用
0.44 0.00 0.44

#使用汇总
system.time(
aggregate(Names〜ID,data = Data,FUN = paste,collapse =|)

用户系统已用
0.47 0.00 0.47

#使用正常的sapply
system.time({
ID < - unique(Da ta $ ID $($)
CombNames< - sapply(ID,function(i){
贴(Data $ Names [Data $ ID == i],collapse =| )
})
data.frame(ID,CombNames)
})
用户系统已用
0.75 0.00 0.75






注意:



为了记录,更好的sapply-solution将是:

  CombNames<  -  sapply(split(Data $ 
data.frame(ID =名称(CombNames),CombNames)

这相当于自动填充,但并行化这一点实际上更慢,因为您必须在 sfSapply 之内移动更多的数据。速度来自将数据集复制到每个cpu,这是您在数据集巨大时您必须牢记的:您将以更多的内存使用率支付速度。


I have a dataframe with 2 columns, one Identifier and column with names. Each Identifier is several times present in the column ID (see below).

 ID           Names
uc001aag.1  DKFZp686C24272
uc001aag.1  DQ786314
uc001aag.1  uc001aag.1
uc001aah.2  AK056232
uc001aah.2  FLJ00038
uc001aah.2  uc001aah.1
uc001aah.2  uc001aah.2
uc001aai.1  AY217347

Now I want to create a dataframe like this:

 ID           Names
uc001aag.1  DKFZp686C24272 | DQ786314 | uc001aag.1
uc001aah.2  AK056232 | FLJ00038 | uc001aah.1 | uc001aah.2
uc001aai.1  AY217347

Can anyone help me?

解决方案

Aggregate is quite a fast one, but you can use an sapply solution to parallelize the code. This can easily be done on Windows using snowfall :

require(snowfall)
sfInit(parallel=TRUE,cpus=2)
sfExport("Data")

ID <- unique(Data$ID)
CombNames <- sfSapply(ID,function(i){
    paste(Data$Names[Data$ID==i],collapse=" | ")
})
data.frame(ID,CombNames)
sfStop()

The parallel version will give you an extra speedup, but the single sapply solution is actually slower than aggregate. Tapply is a bit faster, but can't be parallelized using snowfall. on my computer :

n <- 3000
m <- 3
Data <- data.frame( ID = rep(1:n,m),
                    Names=rep(LETTERS[1:m],each=n))
 # using snowfall for parallel sapply    
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sfSapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 }) 
   user  system elapsed 
   0.02    0.00    0.33 

 # using tapply
 system.time({
   CombNames <- tapply(Data$Names,Data$ID,paste,collapse=" | ")
   data.frame(ID=names(CombNames),CombNames)
 })
   user  system elapsed 
   0.44    0.00    0.44 

 # using aggregate
 system.time(
   aggregate(Names ~ ID, data=Data, FUN=paste, collapse=" | ")
 )
   user  system elapsed 
   0.47    0.00    0.47 

 # using the normal sapply
 system.time({
   ID <- unique(Data$ID)
   CombNames <- sapply(ID,function(i){
     paste(Data$Names[Data$ID==i],collapse=" | ")
   })
   data.frame(ID,CombNames)
 })
   user  system elapsed 
   0.75    0.00    0.75 


Note:

For the record, the better sapply-solution would be :

CombNames <- sapply(split(Data$Names,Data$ID),paste,collapse=" | ")
data.frame(ID=names(CombNames),CombNames)

which is equivalent to tapply. But parallelizing this one is actually slower, as you have to move more data around within the sfSapply. The speed comes from copying the dataset to every cpu. This is what you have to keep in mind when your dataset is huge : you'll pay the speed with more memory usage.

这篇关于R:在数据帧中组合相同的标识符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆