在data.table中的2个不同列中折叠行? [英] collapse rows in 2 different columns in data.table?

查看:44
本文介绍了在data.table中的2个不同列中折叠行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于虚拟数据集

require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)

现在数据集看起来像这样:

Now the dataset looks like this:


   teamid member leader
1:      1    a,b      c
2:      2           d,e
3:      3  c,g,h  

3列.对于每个团队,他们在不同的列中都有团队成员和团队负责人.团队可能只有成员而没有领导者,反之亦然.

3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.

以下是我的 ALMOST 所需的输出:

The following is my ALMOST desired output:


   teamid value leader
1:      1     a  FALSE
2:      1     b  FALSE
3:      1     c   TRUE
4:      1     c   TRUE
5:      2     d   TRUE
6:      2     e   TRUE
7:      3     c  FALSE
8:      3     g  FALSE
9:      3     h  FALSE

我希望将两列合并为一个,如果一个是团队负责人,则添加一个标签.

I want to have the two columns merged into one, and add a tag if one is a team leader.

对此我有一个丑陋的解决方案,

I have an ugly solution for this,

<代码>dt1<-dt [,strsplit(member,,"),by = teamid]dt2<-dt [,strsplit(leader,,"),by = teamid]setkey(dt1,teamid)setkey(dt2,teamid)dt3<-merge(dt1,dt2,all = TRUE)dt4<-熔体(dt3,id = 1,尺寸= c("V1.x","V1.y"))dt5<-dt4 [value!="NA_real"]dt6<-dt5 [,领导:=(变量=="V1.y")] [,变量:= NULL]setkey(dt6,teamid)setnames(dt6,value,member)

问题:

  1. 我认为这种解决方案效率不高,请先合并再融化.那么关于实现此目的的其他方式的任何想法吗?

  1. This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?

第3行和第4行有重复的行.

There're duplicated rows, in row 3 and row 4.

当我尝试更改列名时出现错误

When I tried to change column name, an error came up

setnames(dt6,value,member)

setnames(dt6,value,member)

setnames(dt6,value,member)中的错误:找不到对象'value'

Error in setnames(dt6, value, member) : object 'value' not found

也许最重要的是,

当我尝试对具有100万行,3列的真实数据集进行测试时,发生了以下错误

When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured

merge(df1,df2,all = TRUE)vecseq(f__,len__,if(allow.cartesian)NULL否则为as.integer(max(nrow(x),:连接结果为238797行;大于142095 = max(nrow(x),nrow(i)).检查i中是否有重复的键值,每个键值都一遍又一遍地连接到x中的同一组.如果可以,请尝试包含 j 并删除 by (逐个而不逐个),以便j为每个组运行以避免大分配.如果确定要继续,请使用allow.cartesian = TRUE重新运行.否则,请在FAQ,Wiki,堆栈溢出和数据表帮助中搜索此错误消息,以获取建议.

merge(df1,df2, all = TRUE) Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

有什么建议吗?非常感谢!

Any suggestion? Thanks a lot!

推荐答案

首先融合.

result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
#    teamid status member
# 1:      1 member      a
# 2:      1 member      b
# 3:      1 leader      c
# 4:      2 leader      d
# 5:      2 leader      e
# 6:      3 member      c
# 7:      3 member      g
# 8:      3 member      h

如果您想摆脱 status 列并在成员列中添加标记",则可以采用以下方式:

If you want to get rid of the status column and add a "tag" to the member column, you can do it this way:

result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
#    teamid member
# 1:      1      a
# 2:      1      b
# 3:      1     c*
# 4:      2     d*
# 5:      2     e*
# 6:      3      c
# 7:      3      g
# 8:      3      h

这篇关于在data.table中的2个不同列中折叠行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆