在data.table中的2个不同列中折叠行? [英] collapse rows in 2 different columns in data.table?
问题描述
对于虚拟数据集
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
现在数据集看起来像这样:
Now the dataset looks like this:
teamid member leader
1: 1 a,b c
2: 2 d,e
3: 3 c,g,h
3列.对于每个团队,他们在不同的列中都有团队成员和团队负责人.团队可能只有成员而没有领导者,反之亦然.
3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.
以下是我的 ALMOST 所需的输出:
The following is my ALMOST desired output:
teamid value leader
1: 1 a FALSE
2: 1 b FALSE
3: 1 c TRUE
4: 1 c TRUE
5: 2 d TRUE
6: 2 e TRUE
7: 3 c FALSE
8: 3 g FALSE
9: 3 h FALSE
我希望将两列合并为一个,如果一个是团队负责人,则添加一个标签.
I want to have the two columns merged into one, and add a tag if one is a team leader.
对此我有一个丑陋的解决方案,
I have an ugly solution for this,
<代码>dt1<-dt [,strsplit(member,,"),by = teamid]dt2<-dt [,strsplit(leader,,"),by = teamid]setkey(dt1,teamid)setkey(dt2,teamid)dt3<-merge(dt1,dt2,all = TRUE)dt4<-熔体(dt3,id = 1,尺寸= c("V1.x","V1.y"))dt5<-dt4 [value!="NA_real"]dt6<-dt5 [,领导:=(变量=="V1.y")] [,变量:= NULL]setkey(dt6,teamid)setnames(dt6,value,member)
问题:
-
我认为这种解决方案效率不高,请先合并再融化.那么关于实现此目的的其他方式的任何想法吗?
This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?
第3行和第4行有重复的行.
There're duplicated rows, in row 3 and row 4.
当我尝试更改列名时出现错误
When I tried to change column name, an error came up
setnames(dt6,value,member)
setnames(dt6,value,member)
setnames(dt6,value,member)中的错误:找不到对象'value'
Error in setnames(dt6, value, member) : object 'value' not found
也许最重要的是,
当我尝试对具有100万行,3列的真实数据集进行测试时,发生了以下错误
When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured
merge(df1,df2,all = TRUE)vecseq(f__,len__,if(allow.cartesian)NULL否则为as.integer(max(nrow(x),:连接结果为238797行;大于142095 = max(nrow(x),nrow(i)).检查i中是否有重复的键值,每个键值都一遍又一遍地连接到x中的同一组.如果可以,请尝试包含
j
并删除by
(逐个而不逐个),以便j为每个组运行以避免大分配.如果确定要继续,请使用allow.cartesian = TRUE重新运行.否则,请在FAQ,Wiki,堆栈溢出和数据表帮助中搜索此错误消息,以获取建议.
merge(df1,df2, all = TRUE) Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including
j
and droppingby
(by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
有什么建议吗?非常感谢!
Any suggestion? Thanks a lot!
推荐答案
首先融合.
result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
# teamid status member
# 1: 1 member a
# 2: 1 member b
# 3: 1 leader c
# 4: 2 leader d
# 5: 2 leader e
# 6: 3 member c
# 7: 3 member g
# 8: 3 member h
如果您想摆脱 status
列并在成员列中添加标记",则可以采用以下方式:
If you want to get rid of the status
column and add a "tag" to the member column, you can do it this way:
result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
# teamid member
# 1: 1 a
# 2: 1 b
# 3: 1 c*
# 4: 2 d*
# 5: 2 e*
# 6: 3 c
# 7: 3 g
# 8: 3 h
这篇关于在data.table中的2个不同列中折叠行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!