在data.table中添加行,而不是在某些列使用相同的值时 [英] add rows in a data.table but not when certain columns take same values
问题描述
我有一个data.table dat
有4列,说( col1
, col2
, col3
, col4
)。
输入数据:
结构5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5),col2 = c(3.5,3.5,3.2,3.1,3.6,3.5,3.5,3.2,3.1,
4),col4 =结构(c(1L,1L,1L,1L,1L,4L)),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,1.4, ,4L,4L,4L,4L),.Label = c(setosa,
versicolor,virginica,eer),class =factor)).Names = c col1,
col2,col3,col4),row.names = c(NA,-10L),class = c(data.table,
data.frame ))
r
col1 col2 col3 col4
1:5.1 3.5 1.4 setosa
2:5.1 3.5 1.4 setosa
3:4.7 3.2 1.3 setosa
4:4.6 3.1 1.5 setosa
5:5.0 3.6 1.4 setosa
6:5.1 3.5 3.4 eer
7:5.1 3.5 3.4 eer
8:4.7 3.2 1.3 eer
9:4.6 3.1 1.5 eer
10:5.0 3.6 1.4 eer
对 col4
col3
执行以下操作dat [,r_new:= sum(col3,na.rm = T),。(col4)] #syntax 1
pre>
因此,上面sytnax正在创建一个新列
r_new
$ c> col3 其中col4
是相同的。因此,col4
的每个唯一值在列r_new
中将具有不正确的值。
现在我想做的是做与上面相同的操作,但不包括
col1
和col2
正在使用相同的值(如下所示)dat [col1 is different OR col2 is different,r_new:= sum(col3,na.rm = T),。(col4)]
这将做什么,在执行
sum
函数的行,它不会包括col1
和col2
使用相同的值。
语法与1相同?
预期输出:
col1 col2 col3 col4 r_new
1:5.1 3.5 1.4 setosa 5.6
2:5.1 3.5 1.4 setosa 5.6
3:4.7 3.2 1.3 setosa 5.6
4:4.6 3.1 1.5 setosa 5.6
5:5.0 3.6 1.4 setosa 5.6
6:5.1 3.5 3.4 eer 7.6
7:5.1 3.5 3.4 eer 7.6
8:4.7 3.2 1.3 eer 7.6
9:4.6 3.1 1.5 eer 7.6
10:5.0 3.6 1.4 eer 7.6
正如您在预期输出中看到的,
setosa
第1行和第2行对col1
和col2
和err
行6和7对col1
和col2
,所以我们没有添加这些行(我们只是认为他们一次)。不要担心
col3
(如果col1
和col2
正在使用相同的值。
EDIT:第二个输入:
结构(列表(col1 = c(5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5.1),col2 = c(3.5, 3.2,3.1,3.6,3.5,3.5,3.2,3.1,
3.4),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,3.4,b $ b), col4 = c(A,A,A,A,A,B,B,B,B,B = c(1,1,1,1,1,1,1,1,1,1,1),r_new = c(5.6,5.6,
5.6,5.6,5.6,9.6,9.6,9.6,9.6, 9.6)),.Names = c(col1,
col2,col3,col4,count,r_new),row.names = c(NA,
-10L),class = c(data.table,data.frame))
col1 col2 col3 col4计数r_new
1:5.1 3.5 1.4 A 1 5.6
2:5.1 3.5 1.4 A 1 5.6
3:4.7 3.2 1.3 A 1 5.6
4:4.6 3.1 1.5 A 1 5.6
5:5.0 3.6 1.4 A 1 5.6
6 :5.1 3.5 3.4 B 1 9.6
7:5.1 3.5 3.4 B 1 9.6
8:4.7 3.2 1.3 B 1 9.6
9:4.6 3.1 1.5 B 1 9.6
10:5.1 3.4 3.4 B 1 9.6
EDIT 2:第三次输入
col1 col2 col3 col4计数r_new
1:5.1 3.5 1.4 A 1 5.6
2:5.1 3.5 1.4 A 1 5.6
3:4.7 3.2 1.3 A 1 5.6
4:4.6 3.1 1.5 A 1 5.6
5:5.0 3.6 1.4 A 1 5.6
6:5.1 3.5 3.4 B 1 6.2
7:5.1 3.5 3.4 B 1 6.2
8:4.7 3.2 1.3 B 1 6.2
9:4.6 3.1 1.5 B 1 6.2
10:5.1 3.5 3.4 B 1 6.2
结构(列表(col1 = c(5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5.1),col2 = c(3.5,3.5,3.2 ,3.1,3.6,3.5,3.5,3.2,3.1,
3.5),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,3.4b $ b),col4 = c(A,A,A,A,A,B,B,B,B,B c(1,1,1,1,1,1,1,1,1,1,1),r_new = c(5.6,5.6,
5.6,5.6,5.6,6.2,6.2,6.2,6.2,6.2 )),.Names = c(col1,
col2,col3,col4,count,r_new),row.names = c(NA,
- 10L),class = c(data.table,data.frame))
解决方案我们可以使用
?数据子集
。col3
.table :: duplicatedat [,r_new:= sum(col3 [!duplicated .SD,by = c(col1,col2))],na.rm = T),by = col4]
> dat
#col1 col2 col3 col4 count r_new
#1:5.1 3.5 1.4 A 1 5.6
#2:5.1 3.5 1.4 A 1 5.6
#3:4.7 3.2 1.3 A 1 5.6
#4:4.6 3.1 1.5 A 1 5.6
#5:5.0 3.6 1.4 A 1 5.6
#6:5.1 3.5 3.4 B 1 6.2
#7:5.1 3.5 3.4 B 1 6.2
#8:4.7 3.2 1.3 B 1 6.2
#9:4.6 3.1 1.5 B 1 6.2
#10:5.1 3.5 3.4 B 1 6.2
I have a data.table
dat
with 4 columns, say (col1
,col2
,col3
,col4
).Input data:
structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.6), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 1.4 ), col4 = structure(c(1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L), .Label = c("setosa", "versicolor", "virginica", "eer"), class = "factor")), .Names = c("col1", "col2", "col3", "col4"), row.names = c(NA, -10L), class = c("data.table", "data.frame")) r col1 col2 col3 col4 1: 5.1 3.5 1.4 setosa 2: 5.1 3.5 1.4 setosa 3: 4.7 3.2 1.3 setosa 4: 4.6 3.1 1.5 setosa 5: 5.0 3.6 1.4 setosa 6: 5.1 3.5 3.4 eer 7: 5.1 3.5 3.4 eer 8: 4.7 3.2 1.3 eer 9: 4.6 3.1 1.5 eer 10: 5.0 3.6 1.4 eer
I am performing a following operation on
col3
for each unique value ofcol4
dat[ , r_new:= sum(col3, na.rm = T), .(col4)] #syntax 1
So, above sytnax is creating a new column
r_new
with values got by adding those values ofcol3
wherecol4
is same. So, each unique value ofcol4
will have a unuique value in columnr_new
.What I want to do now, is do the same as above but not include those rows where
col1
andcol2
are taking same values (something like below)dat[col1 is different OR col2 is different , r_new:= sum(col3, na.rm = T), .(col4)]
What this will do, while performing
sum
function over rows, it will not include those rows where bothcol1
andcol2
are taking same values.How can I include this condition in the same syntax as 1?
Expected Output:
col1 col2 col3 col4 r_new 1: 5.1 3.5 1.4 setosa 5.6 2: 5.1 3.5 1.4 setosa 5.6 3: 4.7 3.2 1.3 setosa 5.6 4: 4.6 3.1 1.5 setosa 5.6 5: 5.0 3.6 1.4 setosa 5.6 6: 5.1 3.5 3.4 eer 7.6 7: 5.1 3.5 3.4 eer 7.6 8: 4.7 3.2 1.3 eer 7.6 9: 4.6 3.1 1.5 eer 7.6 10: 5.0 3.6 1.4 eer 7.6
As you can see in the expected output, for
setosa
row 1 and 2 took same value forcol1
andcol2
and forerr
rows 6 and 7 took same values forcol1
andcol2
, so we did not add those rows (we just considered them once). Dont worry aboutcol3
(it will take same value ifcol1
andcol2
are taking same values.EDIT: Second dput:
structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.4), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4 ), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 5.6, 5.6, 5.6, 9.6, 9.6, 9.6, 9.6, 9.6)), .Names = c("col1", "col2", "col3", "col4", "count", "r_new"), row.names = c(NA, -10L), class = c("data.table", "data.frame")) col1 col2 col3 col4 count r_new 1: 5.1 3.5 1.4 A 1 5.6 2: 5.1 3.5 1.4 A 1 5.6 3: 4.7 3.2 1.3 A 1 5.6 4: 4.6 3.1 1.5 A 1 5.6 5: 5.0 3.6 1.4 A 1 5.6 6: 5.1 3.5 3.4 B 1 9.6 7: 5.1 3.5 3.4 B 1 9.6 8: 4.7 3.2 1.3 B 1 9.6 9: 4.6 3.1 1.5 B 1 9.6 10: 5.1 3.4 3.4 B 1 9.6
EDIT 2: Third dput
col1 col2 col3 col4 count r_new 1: 5.1 3.5 1.4 A 1 5.6 2: 5.1 3.5 1.4 A 1 5.6 3: 4.7 3.2 1.3 A 1 5.6 4: 4.6 3.1 1.5 A 1 5.6 5: 5.0 3.6 1.4 A 1 5.6 6: 5.1 3.5 3.4 B 1 6.2 7: 5.1 3.5 3.4 B 1 6.2 8: 4.7 3.2 1.3 B 1 6.2 9: 4.6 3.1 1.5 B 1 6.2 10: 5.1 3.5 3.4 B 1 6.2 structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 3.5), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4 ), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 5.6, 5.6, 5.6, 6.2, 6.2, 6.2, 6.2, 6.2)), .Names = c("col1", "col2", "col3", "col4", "count", "r_new"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
解决方案We can subset
col3
insidej
using?data.table::duplicated
.dat[, r_new := sum(col3[!duplicated(.SD, by = c("col1","col2"))], na.rm = T), by = col4] > dat # col1 col2 col3 col4 count r_new # 1: 5.1 3.5 1.4 A 1 5.6 # 2: 5.1 3.5 1.4 A 1 5.6 # 3: 4.7 3.2 1.3 A 1 5.6 # 4: 4.6 3.1 1.5 A 1 5.6 # 5: 5.0 3.6 1.4 A 1 5.6 # 6: 5.1 3.5 3.4 B 1 6.2 # 7: 5.1 3.5 3.4 B 1 6.2 # 8: 4.7 3.2 1.3 B 1 6.2 # 9: 4.6 3.1 1.5 B 1 6.2 #10: 5.1 3.5 3.4 B 1 6.2
这篇关于在data.table中添加行,而不是在某些列使用相同的值时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!