计算df2的几列之间的平均值,该平均值可能会根据df1的变量“ var1”而有所不同,并将该值添加到df1中的新变量中 [英] Calculate the mean between several columns of df2 that can vary according to the variable `var1` of df1 and add the value to a new variable in df1

查看:58
本文介绍了计算df2的几列之间的平均值,该平均值可能会根据df1的变量“ var1”而有所不同,并将该值添加到df1中的新变量中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框 df1 ,它总结了不同地方不同时间的鱼类的深度。

I have a data frame df1 that summarises the depth of different fishes over time at different places.

另一方面,我有 df2 总结了一段时间内电流的强度(每三个小时)从地面到39米深度,间隔为8​​米( m0-7 m8-15 m16-23 m24-31 m32-39 )具体的地方。例如:

On the other hand, I have df2 that summarises the intensity of the currents over time (EVERY THREE HOURS) from the surface to 39 meters depth at intervals of 8 meters (m0-7, m8-15, m16-23, m24-31 and m32-39) in a specific place. As an example:

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")


> df1
             Datetime Site Ind Depth
1 2016-08-01 15:34:07   BD  16   5.3
2 2016-08-01 16:25:16   HG  17  24.0
3 2016-08-01 17:29:16   BD  19  36.4
4 2016-08-01 18:33:16   BD  16  42.0
5 2016-08-01 20:54:16   BD  17    NA
6 2016-08-01 22:48:16   BD  16  22.1

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

> df2
             Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39
1 2016-08-01 12:00:00   BD 2.75  3.00   2.75   3.25   3.00
2 2016-08-01 15:00:00   BD 4.00  4.00   4.00   3.00   4.00
3 2016-08-01 18:00:00   BD 6.75  4.75   5.75   6.50   4.75
4 2016-08-01 21:00:00   BD 2.25  3.00   2.25   2.75   3.00
5 2016-08-02 00:00:00   BD 4.30  2.10   1.40   3.40   1.70

我想在 df1 中创建一个变量,该变量反映鱼类不愿钓鱼的深度层的平均电流。例如,如果鱼在20米深处,对应于 m16-23 层,我想知道 m0层的平均电流-7 m8-15 m24-31 m32-39

I want to create a variable in df1 that reflects the mean current for the depth layers in which the fish WASN'T. For instance, if fish is at 20 meters depth, which corresponds to the layer m16-23, I want to know the mean current for the layers m0-7, m8-15, m24-31 and m32-39.

注1:如果我的鱼的深度超过39米,我认为它就像是在最深的一层( m32-39 )。 df1 的第4行中的示例。

Note1: if my fish was to a depth higher than 39 meters, I consider it as if it was at the deepest layer (m32-39). An example of this in row 4 of df1.

注2:由于当前记录是每三个小时,因此每条记录 df2 $ Datetime 中指示的小时表示多一小时半,少一小时半。也就是说, df2 中在 21:00:00 中指出的电流强度反映了<$ c $之间的电流c> 19:30:00 和 22:30:00 。其余时间也一样。

Note2: since the current records are every three hours, every hour indicated in df2$Datetime represents one hour and a half more, and one hour and a half less. That is, the current intensity pointed out in df2 at 21:00:00 reflects the currents between 19:30:00 and 22:30:00. The same with the rest of the hours.

我希望这样:

> df1
             Datetime Site Ind Depth current.Mean
1 2016-08-01 15:34:07   BD  16   5.3         3.75
2 2016-08-01 16:25:16   HG  17  24.0           NA
3 2016-08-01 17:29:16   BD  19  36.4         5.94
4 2016-08-01 18:33:16   BD  16  42.0         5.94
5 2016-08-01 20:54:16   BD  17    NA           NA
6 2016-08-01 22:48:16   BD  16  22.1         2.87

有人知道怎么做吗?

推荐答案

这个问题包括一些有趣的挑战:

This question comprises interesting challenges:


  1. OP正在请求部分反连接 ,即OP希望将当前数据汇总到 df2 其中 Datetime Site 是匹配的,但深度层确实不是

  2. 当前数据 df2 在查找表中给出,其中每个值都与深度范围相关联(深度层)和3小时的时间范围。因此,在 df1 中测量的深度 Datetime 被映射到相应的范围。

  1. The OP is asking for a "partial anti-join", i.e., the OP wants to aggregate current data in df2 where Datetime and Site are matching but the depth layer does not.
  2. The current data df2 are given in a look-up table where each value is associated with a depth range (depth layer) and a time range of 3 hours. So, the measured Depth and Datetime in df1 need to be mapped onto the respective ranges.

我尝试了不同的方法,但最终得到了以下方法,该方法没有对聚合函数。因此,可以直接调用 mean()

I have tried different approaches but I ended up with the one below which does not make assumptions about the aggregate function. So, mean() can be called directly.

library(data.table)
library(magrittr)

# reshape df2 from wide to long format
currents <- melt(setDT(df2), id.vars = c("Datetime", "Site"),
                 variable.name = "layer", value.name = "current")

# create columns to join on
labels <- names(df2) %>% stringr::str_subset("^m")
breaks <- c(seq(0, 32, 8), Inf)
setDT(df1)[, layer := cut(Depth, breaks = breaks, labels = labels)]
df1[, current.dt := df2[df1, on = .(Site, Datetime), 
                      roll = "nearest", x.Datetime]]

# "partial anti-join" to compute mean of other layers
currents_other_layers <- 
  currents[df1, on = .(Site, Datetime = current.dt)][
    layer != i.layer, mean(current), by = .(i.Datetime, Site)]

# append result column
df1[currents_other_layers, on = .(Site, Datetime = i.Datetime), current.mean := i.V1]
df1




               Datetime Site Ind Depth  layer          current.dt current.mean
1: 2016-08-01 15:34:07   BD  16   5.3   m0-7 2016-08-01 15:00:00       3.7500
2: 2016-08-01 16:25:16   HG  17  24.0 m16-23                <NA>           NA
3: 2016-08-01 17:29:16   BD  19  36.4 m32-39 2016-08-01 18:00:00       5.9375
4: 2016-08-01 18:33:16   BD  16  42.0 m32-39 2016-08-01 18:00:00       5.9375
5: 2016-08-01 20:54:16   BD  17    NA   <NA> 2016-08-01 21:00:00           NA
6: 2016-08-01 22:48:16   BD  16  22.1 m16-23 2016-08-02 00:00:00       2.8750


这重现了OP的预期结果。

This reproduces OP's expected result.

df2 从宽格式改成长格式。这允许在列上加入/反加入。

df2 is reshaped from wide to long format. This allows for joining / anti-joining on the layer column.

currents




               Datetime Site  layer current
 1: 2016-08-01 12:00:00   BD   m0-7    2.75
 2: 2016-08-01 15:00:00   BD   m0-7    4.00
 3: 2016-08-01 18:00:00   BD   m0-7    6.75
 4: 2016-08-01 21:00:00   BD   m0-7    2.25
 5: 2016-08-02 00:00:00   BD   m0-7    4.30
 6: 2016-08-01 12:00:00   BD  m8-15    3.00
 7: 2016-08-01 15:00:00   BD  m8-15    4.00
 8: 2016-08-01 18:00:00   BD  m8-15    4.75
 9: 2016-08-01 21:00:00   BD  m8-15    3.00
10: 2016-08-02 00:00:00   BD  m8-15    2.10
11: 2016-08-01 12:00:00   BD m16-23    2.75
12: 2016-08-01 15:00:00   BD m16-23    4.00
13: 2016-08-01 18:00:00   BD m16-23    5.75
14: 2016-08-01 21:00:00   BD m16-23    2.25
15: 2016-08-02 00:00:00   BD m16-23    1.40
16: 2016-08-01 12:00:00   BD m24-31    3.25
17: 2016-08-01 15:00:00   BD m24-31    3.00
18: 2016-08-01 18:00:00   BD m24-31    6.50
19: 2016-08-01 21:00:00   BD m24-31    2.75
20: 2016-08-02 00:00:00   BD m24-31    3.40
21: 2016-08-01 12:00:00   BD m32-39    3.00
22: 2016-08-01 15:00:00   BD m32-39    4.00
23: 2016-08-01 18:00:00   BD m32-39    4.75
24: 2016-08-01 21:00:00   BD m32-39    3.00
25: 2016-08-02 00:00:00   BD m32-39    1.70
               Datetime Site  layer current


现在, df1 必须修改为在电流 layer Datetime 相对应的列c>。

Now, df1 has to be amended to include columns which correspond to layer and Datetime in currents.

对于深度 cut()函数是用过的。最后一层 m32-39 扩展到 Inf ,因此根据要求,所有大于32 m的深度都包括在该层中

For Depth, the cut() function is used. The last layer level m32-39 is extended to Inf so all depths greater 32 m are included in this level as requested by the OP.

对于 Datetime ,滚动加入最近的 <$使用 df2 中的c $ c> Datetime 。这是可能的,因为 df2 $ Datetime 表示3小时时间范围的中点。

For Datetime, a rolling join to the nearest Datetime in df2 is used. This is possible because df2$Datetime denotes the mid-point of the 3 hour time range.

之后 df1 已经准备好了,我们可以做部分反连接了。不幸的是, data.table 的非等价联接不接受!= 运算符。因此,我们无法编写

After df1 has been prepared, we can do the "partial anti-join". Unfortunately, data.table's non-equi joins does not accept the != operator. So, we cannot write

currents[df1, on = .(Datetime = current.dt, Site, layer != layer)]

直接,但是必须使用变通方法,在该方法中,我们首先选择期望匹配的行,然后做一个反连接:

directly but have to use a work-around where we first pick the rows where we expect matches and then do an anti-join:

 currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)]




               Datetime Site  layer current          i.Datetime Ind Depth i.layer
 1: 2016-08-01 15:00:00   BD  m8-15    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 2: 2016-08-01 15:00:00   BD m16-23    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 3: 2016-08-01 15:00:00   BD m24-31    3.00 2016-08-01 15:34:07  16   5.3    m0-7
 4: 2016-08-01 15:00:00   BD m32-39    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 5: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 17:29:16  19  36.4  m32-39
 6: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 17:29:16  19  36.4  m32-39
 7: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 17:29:16  19  36.4  m32-39
 8: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 17:29:16  19  36.4  m32-39
 9: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 18:33:16  16  42.0  m32-39
10: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 18:33:16  16  42.0  m32-39
11: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 18:33:16  16  42.0  m32-39
12: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 18:33:16  16  42.0  m32-39
13: 2016-08-01 21:00:00   BD   m0-7    2.25 2016-08-01 20:54:16  17    NA    <NA>
14: 2016-08-01 21:00:00   BD  m8-15    3.00 2016-08-01 20:54:16  17    NA    <NA>
15: 2016-08-01 21:00:00   BD m16-23    2.25 2016-08-01 20:54:16  17    NA    <NA>
16: 2016-08-01 21:00:00   BD m24-31    2.75 2016-08-01 20:54:16  17    NA    <NA>
17: 2016-08-01 21:00:00   BD m32-39    3.00 2016-08-01 20:54:16  17    NA    <NA>
18: 2016-08-02 00:00:00   BD   m0-7    4.30 2016-08-01 22:48:16  16  22.1  m16-23
19: 2016-08-02 00:00:00   BD  m8-15    2.10 2016-08-01 22:48:16  16  22.1  m16-23
20: 2016-08-02 00:00:00   BD m24-31    3.40 2016-08-01 22:48:16  16  22.1  m16-23
21: 2016-08-02 00:00:00   BD m32-39    1.70 2016-08-01 22:48:16  16  22.1  m16-23
22:                <NA>   HG   <NA>      NA 2016-08-01 16:25:16  17  24.0  m16-23
               Datetime Site  layer current          i.Datetime Ind Depth i.layer


可以根据需要通过任意聚合函数进行聚合(无需有选择地手动添加单个列):

This can be aggregated as desired by an arbitrary aggregation function (no need to manually add single columns selectively):

currents_other_layers <- 
  currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)][
      !is.na(Depth), mean(current), by = .(i.Datetime, Site)]

currents_other_layers




            i.Datetime Site     V1
1: 2016-08-01 15:34:07   BD 3.7500
2: 2016-08-01 17:29:16   BD 5.9375
3: 2016-08-01 18:33:16   BD 5.9375
4: 2016-08-01 22:48:16   BD 2.8750
5: 2016-08-01 16:25:16   HG     NA


此结果包含除t以外的所有其他层的平均电流他观察了层。请注意,分组是通过 i.Datetime 进行的,这是指 df1 $ Datetime Site 。在 df1 中缺少 Depth 的行将被省略,以达到OP的预期结果。

This result contains the average currents of all other layers except the observed layer. Note that grouping is by i.Datetime which refers to df1$Datetime and Site. Rows where Depth is missing in df1 are omitted to meet OP's expected result.

最后的 update join 将结果列附加到 df1

A final update join appends the result column to df1.

这篇关于计算df2的几列之间的平均值,该平均值可能会根据df1的变量“ var1”而有所不同,并将该值添加到df1中的新变量中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆