在data.table中添加行,而不是在某些列使用相同的值时 [英] add rows in a data.table but not when certain columns take same values

查看:101
本文介绍了在data.table中添加行,而不是在某些列使用相同的值时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table dat 有4列,说( col1 col2 col3 col4 )。



输入数据:

 结构5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5),col2 = c(3.5,3.5,3.2,3.1,3.6,3.5,3.5,3.2,3.1,
4),col4 =结构(c(1L,1L,1L,1L,1L,4L)),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,1.4, ,4L,4L,4L,4L),.Label = c(setosa,
versicolor,virginica,eer),class =factor)).Names = c col1,
col2,col3,col4),row.names = c(NA,-10L),class = c(data.table,
data.frame ))

r
col1 col2 col3 col4
1:5.1 3.5 1.4 setosa
2:5.1 3.5 1.4 setosa
3:4.7 3.2 1.3 setosa
4:4.6 3.1 1.5 setosa
5:5.0 3.6 1.4 setosa
6:5.1 3.5 3.4 eer
7:5.1 3.5 3.4 eer
8:4.7 3.2 1.3 eer
9:4.6 3.1 1.5 eer
10:5.0 3.6 1.4 eer

col4

的每个唯一值对 col3 执行以下操作

  dat [,r_new:= sum(col3,na.rm = T),。(col4)] #syntax 1 
pre>

因此,上面sytnax正在创建一个新列 r_new $ c> col3 其中 col4 是相同的。因此, col4 的每个唯一值在列 r_new 中将具有不正确的值。



现在我想做的是做与上面相同的操作,但包括 col1 col2 正在使用相同的值(如下所示)

  dat [col1 is different OR col2 is different,r_new:= sum(col3,na.rm = T),。(col4)] 

这将做什么,在执行 sum 函数的行,它不会包括 col1 col2 使用相同的值。



语法与1相同?



预期输出:

  col1 col2 col3 col4 r_new 
1:5.1 3.5 1.4 setosa 5.6
2:5.1 3.5 1.4 setosa 5.6
3:4.7 3.2 1.3 setosa 5.6
4:4.6 3.1 1.5 setosa 5.6
5:5.0 3.6 1.4 setosa 5.6
6:5.1 3.5 3.4 eer 7.6
7:5.1 3.5 3.4 eer 7.6
8:4.7 3.2 1.3 eer 7.6
9:4.6 3.1 1.5 eer 7.6
10:5.0 3.6 1.4 eer 7.6

正如您在预期输出中看到的, setosa 第1行和第2行对 col1 col2 err 行6和7对 col1 col2 ,所以我们没有添加这些行(我们只是认为他们一次)。不要担心 col3 (如果 col1 col2 正在使用相同的值。



EDIT:第二个输入

 结构(列表(col1 = c(5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5.1),col2 = c(3.5, 3.2,3.1,3.6,3.5,3.5,3.2,3.1,
3.4),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,3.4,b $ b), col4 = c(A,A,A,A,A,B,B,B,B,B = c(1,1,1,1,1,1,1,1,1,1,1),r_new = c(5.6,5.6,
5.6,5.6,5.6,9.6,9.6,9.6,9.6, 9.6)),.Names = c(col1,
col2,col3,col4,count,r_new),row.names = c(NA,
-10L),class = c(data.table,data.frame))

col1 col2 col3 col4计数r_new
1:5.1 3.5 1.4 A 1 5.6
2:5.1 3.5 1.4 A 1 5.6
3:4.7 3.2 1.3 A 1 5.6
4:4.6 3.1 1.5 A 1 5.6
5:5.0 3.6 1.4 A 1 5.6
6 :5.1 3.5 3.4 B 1 9.6
7:5.1 3.5 3.4 B 1 9.6
8:4.7 3.2 1.3 B 1 9.6
9:4.6 3.1 1.5 B 1 9.6
10:5.1 3.4 3.4 B 1 9.6

EDIT 2:第三次输入

  col1 col2 col3 col4计数r_new 
1:5.1 3.5 1.4 A 1 5.6
2:5.1 3.5 1.4 A 1 5.6
3:4.7 3.2 1.3 A 1 5.6
4:4.6 3.1 1.5 A 1 5.6
5:5.0 3.6 1.4 A 1 5.6
6:5.1 3.5 3.4 B 1 6.2
7:5.1 3.5 3.4 B 1 6.2
8:4.7 3.2 1.3 B 1 6.2
9:4.6 3.1 1.5 B 1 6.2
10:5.1 3.5 3.4 B 1 6.2


结构(列表(col1 = c(5.1,5.1,4.7,4.6,5,5.1,5.1,4.7,
4.6,5.1),col2 = c(3.5,3.5,3.2 ,3.1,3.6,3.5,3.5,3.2,3.1,
3.5),col3 = c(1.4,1.4,1.3,1.5,1.4,3.4,3.4,1.3,1.5,3.4b $ b),col4 = c(A,A,A,A,A,B,B,B,B,B c(1,1,1,1,1,1,1,1,1,1,1),r_new = c(5.6,5.6,
5.6,5.6,5.6,6.2,6.2,6.2,6.2,6.2 )),.Names = c(col1,
col2,col3,col4,count,r_new),row.names = c(NA,
- 10L),class = c(data.table,data.frame))


解决方案

我们可以使用?数据子集 col3 .table :: duplicate

  dat [,r_new:= sum(col3 [!duplicated .SD,by = c(col1,col2))],na.rm = T),by = col4] 

> dat
#col1 col2 col3 col4 count r_new
#1:5.1 3.5 1.4 A 1 5.6
#2:5.1 3.5 1.4 A 1 5.6
#3:4.7 3.2 1.3 A 1 5.6
#4:4.6 3.1 1.5 A 1 5.6
#5:5.0 3.6 1.4 A 1 5.6
#6:5.1 3.5 3.4 B 1 6.2
#7:5.1 3.5 3.4 B 1 6.2
#8:4.7 3.2 1.3 B 1 6.2
#9:4.6 3.1 1.5 B 1 6.2
#10:5.1 3.5 3.4 B 1 6.2


I have a data.table dat with 4 columns, say (col1, col2, col3, col4).

Input data:

structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 
4.6, 5), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 
3.6), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 1.4
), col4 = structure(c(1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L), .Label = c("setosa", 
"versicolor", "virginica", "eer"), class = "factor")), .Names = c("col1", 
"col2", "col3", "col4"), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

r
    col1 col2 col3   col4
 1:  5.1  3.5  1.4 setosa
 2:  5.1  3.5  1.4 setosa
 3:  4.7  3.2  1.3 setosa
 4:  4.6  3.1  1.5 setosa
 5:  5.0  3.6  1.4 setosa
 6:  5.1  3.5  3.4    eer
 7:  5.1  3.5  3.4    eer
 8:  4.7  3.2  1.3    eer
 9:  4.6  3.1  1.5    eer
10:  5.0  3.6  1.4    eer

I am performing a following operation on col3 for each unique value of col4

dat[ , r_new:= sum(col3, na.rm = T), .(col4)]    #syntax 1

So, above sytnax is creating a new column r_new with values got by adding those values of col3 where col4 is same. So, each unique value of col4 will have a unuique value in column r_new.

What I want to do now, is do the same as above but not include those rows where col1 and col2 are taking same values (something like below)

dat[col1 is different OR col2 is different , r_new:= sum(col3, na.rm = T), .(col4)]

What this will do, while performing sum function over rows, it will not include those rows where both col1 and col2 are taking same values.

How can I include this condition in the same syntax as 1?

Expected Output:

    col1 col2 col3   col4 r_new
 1:  5.1  3.5  1.4 setosa   5.6
 2:  5.1  3.5  1.4 setosa   5.6
 3:  4.7  3.2  1.3 setosa   5.6
 4:  4.6  3.1  1.5 setosa   5.6
 5:  5.0  3.6  1.4 setosa   5.6
 6:  5.1  3.5  3.4    eer   7.6
 7:  5.1  3.5  3.4    eer   7.6
 8:  4.7  3.2  1.3    eer   7.6
 9:  4.6  3.1  1.5    eer   7.6
10:  5.0  3.6  1.4    eer   7.6

As you can see in the expected output, for setosa row 1 and 2 took same value for col1 and col2 and for err rows 6 and 7 took same values for col1 and col2, so we did not add those rows (we just considered them once). Dont worry about col3 (it will take same value if col1 and col2 are taking same values.

EDIT: Second dput:

structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 
4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 
3.4), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4
), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), 
    count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 
    5.6, 5.6, 5.6, 9.6, 9.6, 9.6, 9.6, 9.6)), .Names = c("col1", 
"col2", "col3", "col4", "count", "r_new"), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

    col1 col2 col3 col4 count r_new
 1:  5.1  3.5  1.4    A     1   5.6
 2:  5.1  3.5  1.4    A     1   5.6
 3:  4.7  3.2  1.3    A     1   5.6
 4:  4.6  3.1  1.5    A     1   5.6
 5:  5.0  3.6  1.4    A     1   5.6
 6:  5.1  3.5  3.4    B     1   9.6
 7:  5.1  3.5  3.4    B     1   9.6
 8:  4.7  3.2  1.3    B     1   9.6
 9:  4.6  3.1  1.5    B     1   9.6
10:  5.1  3.4  3.4    B     1   9.6

EDIT 2: Third dput

   col1 col2 col3 col4 count r_new
 1:  5.1  3.5  1.4    A     1   5.6
 2:  5.1  3.5  1.4    A     1   5.6
 3:  4.7  3.2  1.3    A     1   5.6
 4:  4.6  3.1  1.5    A     1   5.6
 5:  5.0  3.6  1.4    A     1   5.6
 6:  5.1  3.5  3.4    B     1   6.2
 7:  5.1  3.5  3.4    B     1   6.2
 8:  4.7  3.2  1.3    B     1   6.2
 9:  4.6  3.1  1.5    B     1   6.2
10:  5.1  3.5  3.4    B     1   6.2


structure(list(col1 = c(5.1, 5.1, 4.7, 4.6, 5, 5.1, 5.1, 4.7, 
4.6, 5.1), col2 = c(3.5, 3.5, 3.2, 3.1, 3.6, 3.5, 3.5, 3.2, 3.1, 
3.5), col3 = c(1.4, 1.4, 1.3, 1.5, 1.4, 3.4, 3.4, 1.3, 1.5, 3.4
), col4 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), 
    count = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), r_new = c(5.6, 5.6, 
    5.6, 5.6, 5.6, 6.2, 6.2, 6.2, 6.2, 6.2)), .Names = c("col1", 
"col2", "col3", "col4", "count", "r_new"), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

解决方案

We can subset col3 inside j using ?data.table::duplicated.

dat[, r_new := sum(col3[!duplicated(.SD, by = c("col1","col2"))], na.rm = T), by = col4]  

> dat
#      col1 col2 col3 col4 count r_new
# 1:  5.1  3.5  1.4    A     1   5.6
# 2:  5.1  3.5  1.4    A     1   5.6
# 3:  4.7  3.2  1.3    A     1   5.6
# 4:  4.6  3.1  1.5    A     1   5.6
# 5:  5.0  3.6  1.4    A     1   5.6
# 6:  5.1  3.5  3.4    B     1   6.2
# 7:  5.1  3.5  3.4    B     1   6.2
# 8:  4.7  3.2  1.3    B     1   6.2
# 9:  4.6  3.1  1.5    B     1   6.2
#10:  5.1  3.5  3.4    B     1   6.2

这篇关于在data.table中添加行,而不是在某些列使用相同的值时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆