R DataTable连接和约束行 [英] R DataTable Join and constrain rows
问题描述
我想在一个数据表中总结一组观察结果,并且可以使用一些语法帮助。
我认为这和连接一样简单,我试图确定具体的观察日期,即使在当天的多个测量或传感器,看到具体的值。
- 观察日期具有不同的度量计数(每日行数)
- 'M'easurement列表示特定值
我已经创建了2个样本数据集,我希望这些数据集能够明确目标。我还创建了一个excel电子表格的图像,希望显示数据之间的关系。
library(data.table )
raw< - data.table(
Date = as.Date(c(2013-5-4,2013-5-4,2013-5-4 2013-5-9,2013-5-9,2013-5-16,2013-5-16,2013-5-16,2013-5-30)),
S1 = c(4,2,3,1,1,8,7,3,3),
S2 = c(2,5,2,4,4,9,1,6, 4),
S3 = c(6,2,2,7,3,2,7,2,1)
)
总结< - data.table
Date = as.Date(c(2013-5-4,2013-5-9,2013-5-16,2013-5-30)),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c b M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE) $ b M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE) b $ b M10 = c(FALSE,FALSE,TRUE,FALSE)
)
img src =https://i.stack.imgur.com/bydHL.jpgalt =Excel图像显示数据集之间的关系>
Excel
Raw是测量输入。多次测量可在同一观察日期(即多行)进行。
总结是我希望得到的结果。总结行,并且'm'测量列仅指示在任何V列中的当天观察到值(跟随M,即M1,M2)。例如,在5/16的第一次和最后一次观察中看到数字2,但是在5/16的9个值中没有出现数字5.
我想我需要使用一个联接,但是如何计算M列逃避我。
任何帮助是非常感激。
问题:在数据科学或数学中是否有这种类型的操作的名称?
更新:
我正在尝试下列
setkey(raw,Date)
s< - data.table unique(raw $ Date))#获取唯一日期的数据类型
setkey(s,Date)
s [raw,M1:=(length(na.omit请注意,如果您要查看原始档案,请按照下列步骤操作:
,
,
,
,这些值不是5-4的预期值(应为FALSE)。我认为这是因为原始行不在我的匹配语句中约束。 日期M1
1:2013 -05-04 TRUE
2:2013-05-09 TRUE
3:2013-05-16 TRUE
4:2013-05-30 TRUE
我的猜测是我需要使用不同的东西来连接原始行。
解决方案这似乎有效:
raw [,lapply(1:10,`%in%`,unique(unlist(.SD))),by = Date]
>
结果是
日期V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:2013-05-04假真假真假真假假假假
2:2013-05-09真假真假真假假真假假假
3:2013- 05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
4:2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
如果您希望使用M而不是V的列,请使用 c(M = 1:10)
代替 1:10
。
I'd like to summarize a set of observations in a datatable and could use some help with the syntax.
I think this is as simple as a join but I'm trying to identify that specific values were seen on a specific observation DAY even if its across multiple measurements or sensors on that day.
- observations are summarized by date
- observations date have varied counts of measurements (rows per date)
- 'M'easurement columns indicate that a specific value was observed in ANY sensor for the day.
I've created 2 sample sets of data that I hope will clarify the goal. I've also created an image of an excel spreadsheet that hopes to show the relationship between the data.
library(data.table)
raw <- data.table(
Date = as.Date(c("2013-5-4","2013-5-4","2013-5-4", "2013-5-9","2013-5-9", "2013-5-16","2013-5-16","2013-5-16", "2013-5-30")),
S1 = c(4, 2, 3, 1, 1, 8, 7, 3, 3),
S2 = c(2, 5, 2, 4, 4, 9, 1, 6, 4),
S3 = c(6, 2, 2, 7, 3, 2, 7, 2, 1)
)
summarized <- data.table(
Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
M1 = c(FALSE,TRUE,TRUE,TRUE),
M2 = c(TRUE,FALSE,TRUE,FALSE),
M3 = c(TRUE,TRUE,TRUE,TRUE),
M4 = c(TRUE,FALSE,FALSE,TRUE),
M5 = c(TRUE,FALSE,FALSE,FALSE),
M6 = c(TRUE,FALSE,TRUE,FALSE),
M7 = c(FALSE,TRUE,TRUE,FALSE),
M8 = c(FALSE,FALSE,TRUE,FALSE),
M9 = c(FALSE,FALSE,TRUE,FALSE),
M10 = c(FALSE,FALSE,TRUE,FALSE)
)
Excel
Raw is the measurements input. Multiple measurements can happen on the same observation date (i.e. multiple rows).
Summarized is what I'm hoping to get out. Rows are summarized and the 'm'easurement columns merely indicate that the value (following the M, i.e. M1, M2) was observed on the day in any of the V columns. For example, the number 2 was seen on the first and last observation on 5/16, but the number 5 was not seen in any of the 9 values on 5/16.
I think I need to use a join but how to calculate the M columns escapes me.
Any help is much appreciated.
Question: is there a name for this type of operation in data science or mathematics?
Update:
I'm trying the following
setkey(raw,Date)
s <- data.table( Date=unique(raw$Date)) # get a datatable of the unique dates
setkey(s,Date)
s[raw, M1:=(length(na.omit(match(c(raw$V1,raw$v2,raw$v3),1)))>=1)]
Note that the values are not what's expected for 5-4 (should be FALSE). I think this is becuase the raw rows are not being constrained in my match statement.
Date M1
1: 2013-05-04 TRUE
2: 2013-05-09 TRUE
3: 2013-05-16 TRUE
4: 2013-05-30 TRUE
My guess is I need to use something different to subset the raw rows in the join.
解决方案 This seems to work:
raw[,lapply(1:10,`%in%`,unique(unlist(.SD))),by=Date]
The result is
Date V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 2013-05-04 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
2: 2013-05-09 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
3: 2013-05-16 TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
4: 2013-05-30 TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
If you want the columns with "M" instead of "V", use c(M=1:10)
in place of 1:10
.
这篇关于R DataTable连接和约束行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!