如何对数据中的关系进行排序,以使先前观察到的值首先出现 [英] How to order the ties in data so that the previously observed value appears first

查看:62
本文介绍了如何对数据中的关系进行排序,以使先前观察到的值首先出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  ex<-structure(list(group = c(1,1,1,1,1,1,1,1,2,2,2,2,
2,2,2,2,2),时间戳=结构(c(1504975114,1504975115,
1504975116,1504975116,1504975121,1504975121,1504975121,1504975121,
1504963482,1504963486,1504963486,1504964343 ,1504964343、1504964394,
1504964394、1504964394、1504964394),类别= c( POSIXct, POSIXt
),tzone = UTC),子组= c(36L,36L,36L, 35L,36L,35L,
35L,36L,43L,43L,14L,14L,14L,14L,14L,43L,43L),A = c(1L,
49L,1L,74L,12L ,61L,5L,5L,1L,30L,30L,18L,19L,32L,
40L,32L,40L),B = c(1L,1L,0L,1L,1L,1L,0L,1L, 1L,1L,
1L,1L,1L,1L,1L,1L,1L)),.names = c( group, timestamp,
subgroup, A, B),类= c( tbl_df, tbl, data.frame
),row.names = c(NA,-17L))

我有类似上面的数据。我想按时间戳对中的数据进行排序,但还要注意如何处理时间戳中的联系。准确地说,如果两个观测值具有相同的时间戳,则我想首先拥有这个观测值,它具有与上一个时间戳相同的子组 id。因此所需的输出如下所示:

 #小标题:17 x 5 
组时间戳子组AB
< dbl> < dttm> < int> < int> < int>
1 1.00 2017-09-09 16:38:34 36 1 1
2 1.00 2017-09-09 16:38:35 36 49 1
3 1.00 2017-09-09 16 :38:36 36 1 0
4 1.00 2017-09-09 16:38:36 35 74 1
5 1.00 2017-09-09 16:38:41 35 61 1
6 1.00 2017-09-09 16:38:41 35 5 0
7 1.00 2017-09-09 16:38:41 36 12 1
8 1.00 2017-09-09 16:38:41 36 5 1
9 2.00 2017-09-09 13:24:42 43 1 1
10 2.00 2017-09-09 13:24:46 43 30 1
11 2.00 2017-09- 09 13:24:46 14 30 1
12 2.00 2017-09-09 13:39:03 14 18 1
13 2.00 2017-09-09 13:39:03 14 19 1
14 2.00 2017-09-09 13:39:54 14 32 1
15 2.00 2017-09-09 13:39:54 14 40 1
16 2.00 2017-09-09 13:39: 54 43 32 1
17 2.00 2017-09-09 13:39:54 43 40 1

我该怎么做?


解决方案

以下是使用 tidyverse 的想法:

  library(tidyverse)
ex%>%
group_by(group)%&%
mutate(order = map2(
split_<-split(子组,时间戳),
累加(split_,〜intersect(c(rev(.x)、. y)、. y)),
匹配)%>%未列出)%&%;%
ranging(group,timestamp,order)

## A小食:17 x 6
##组:group [2]
#group timestamp子组AB order
#< dbl> < dttm> < int> < int> < int> < int>
#1 1 2017-09-09 16:38:34 36 1 1 1
#2 1 2017-09-09 16:38:35 36 49 1 1
#3 1 2017 -09-09 16:38:36 36 1 0 1
#4 1 2017-09-09 16:38:36 35 74 1 2
#5 1 2017-09-09 16:38: 41 35 61 1 1
#6 1 2017-09-09 16:38:41 35 5 0 1
#7 1 2017-09-09 16:38:41 36 12 1 2
#8 1 2017-09-09 16:38:41 36 5 1 2
#9 2 2017-09-09 13:24:42 43 1 1 1
#10 2 2017-09- 09 13:24:46 43 30 1 1
#11 2 2017-09-09 13:24:46 14 30 1 2
#12 2 2017-09-09 13:39:03 14 18 1 1
#13 2 2017-09-09 13:39:03 14 19 1 1
#14 2 2017-09-09 13:39:54 14 32 1 1
#15 2 2017-09-09 13:39:54 14 40 1 1
#16 2 2017-09-09 13:39:54 43 32 1 2
#17 2 2017-09-09 13:39:54 43 40 1 2

我做了假设时间戳是事先进行排序的,如果没有,则首先使用 ex%>%range(group,timestamp)%>%... 进行排序。


您可以添加%>%select(-order)%>%ungroup 以获得精确的所需输出(我离开了




解释


让我们仅保留第1组来说明什么发生在mutate调用中:

  ex1<-filter(ex,group == 1)

对于每个时间戳,我们列出一个子组列表:

  split_<-split( ex1 $ subgroup,ex1 $ timestamp)
#$`2017-09-09 16:38:34`
#[1] 36

#$`2017-09 -09 16:38:35`
#[1] 36

#$`2017-09-09 16:38:36`
#[1] 36 35

#$`2017-09-09 16:38:41`
#[1] 36 35 35 36

l的顺序ast项应该更改, 35 应该在 36 之前,因为它在第3个元素中最后使用。由于 intersect 将项目的顺序保留在第一个参数中,所以我可以像这样最后一个项目获得正确的顺序:

  intersect(c(rev(split _ [[3]]),split _ [[4]]),
split _ [[4]])
#[1] 35 36

要将此转换应用于所有元素,我使用 purrr :: accumulate ,因为我总是需要最后一个计算顺序来计算下一个:

  acc_<-accumulate(split_,〜intersect(c(rev (.x),. y),. y))
#[[1]]
#[1] 36

#[[2]]
#[1] 36

#[[3]]
#[1] 36 35

#[[4]]
#[1] 35 36

如果我使用 split _ acc _ 匹配我可以获得比这些元素在输出中应有的订单

  map2(split_,acc_,match)
#$`2017-09-09 16:38:34`
#[1] 1

#$`2017-09-09 16:38:35`
#[1] 1

#$`2017-09-09 16:38:36 `
#[1] 1 2

#$`2017-09-09 16:38:41`
#[1] 2 1 1 2

然后我可以取消列表,它会得到我的 order _ 列,并按 order _ 以获得所需的输出。


ex <- structure(list(group = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2), timestamp = structure(c(1504975114, 1504975115, 
1504975116, 1504975116, 1504975121, 1504975121, 1504975121, 1504975121, 
1504963482, 1504963486, 1504963486, 1504964343, 1504964343, 1504964394, 
1504964394, 1504964394, 1504964394), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), subgroup = c(36L, 36L, 36L, 35L, 36L, 35L, 
35L, 36L, 43L, 43L, 14L, 14L, 14L, 14L, 14L, 43L, 43L), A = c(1L, 
49L, 1L, 74L, 12L, 61L, 5L, 5L, 1L, 30L, 30L, 18L, 19L, 32L, 
40L, 32L, 40L), B = c(1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("group", "timestamp", 
"subgroup", "A", "B"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -17L))

I've got a data like above. I want to sort the data within group by timestamp, but also pay attention how the ties in timestamp are handle. Precisely speaking, if two observations havae the same timestamp, I would like to have as first this observation, that has the same subgroup id as the value from previous timestamp. So the desired output would look like below:

    # A tibble: 17 x 5
    group timestamp           subgroup     A     B
    <dbl> <dttm>                 <int> <int> <int>
 1  1.00 2017-09-09 16:38:34       36     1     1
 2  1.00 2017-09-09 16:38:35       36    49     1
 3  1.00 2017-09-09 16:38:36       36     1     0
 4  1.00 2017-09-09 16:38:36       35    74     1
 5  1.00 2017-09-09 16:38:41       35    61     1
 6  1.00 2017-09-09 16:38:41       35     5     0
 7  1.00 2017-09-09 16:38:41       36    12     1
 8  1.00 2017-09-09 16:38:41       36     5     1
 9  2.00 2017-09-09 13:24:42       43     1     1
10  2.00 2017-09-09 13:24:46       43    30     1
11  2.00 2017-09-09 13:24:46       14    30     1
12  2.00 2017-09-09 13:39:03       14    18     1
13  2.00 2017-09-09 13:39:03       14    19     1
14  2.00 2017-09-09 13:39:54       14    32     1
15  2.00 2017-09-09 13:39:54       14    40     1
16  2.00 2017-09-09 13:39:54       43    32     1
17  2.00 2017-09-09 13:39:54       43    40     1

How can I do this?

解决方案

Here's an idea using tidyverse :

library(tidyverse)
ex %>%
  group_by(group) %>%
  mutate(order = map2(
    split_ <- split(subgroup,timestamp),
    accumulate(split_, ~intersect(c(rev(.x),.y),.y)),
    match) %>% unlist) %>%
  arrange(group,timestamp,order) 

# # A tibble: 17 x 6
# # Groups:   group [2]
#    group           timestamp subgroup     A     B order
#    <dbl>              <dttm>    <int> <int> <int> <int>
#  1     1 2017-09-09 16:38:34       36     1     1     1
#  2     1 2017-09-09 16:38:35       36    49     1     1
#  3     1 2017-09-09 16:38:36       36     1     0     1
#  4     1 2017-09-09 16:38:36       35    74     1     2
#  5     1 2017-09-09 16:38:41       35    61     1     1
#  6     1 2017-09-09 16:38:41       35     5     0     1
#  7     1 2017-09-09 16:38:41       36    12     1     2
#  8     1 2017-09-09 16:38:41       36     5     1     2
#  9     2 2017-09-09 13:24:42       43     1     1     1
# 10     2 2017-09-09 13:24:46       43    30     1     1
# 11     2 2017-09-09 13:24:46       14    30     1     2
# 12     2 2017-09-09 13:39:03       14    18     1     1
# 13     2 2017-09-09 13:39:03       14    19     1     1
# 14     2 2017-09-09 13:39:54       14    32     1     1
# 15     2 2017-09-09 13:39:54       14    40     1     1
# 16     2 2017-09-09 13:39:54       43    32     1     2
# 17     2 2017-09-09 13:39:54       43    40     1     2

I made the assumption that timestamp are sorted before hand, if not, sort as a first step with ex %>% arrange(group, timestamp) %>% ....

You can add %>% select(-order) %>% ungroup to get precisely your desired output (I left it this way to make it easier to understand).


explanations

Let's keep only group 1 to illustrate what happens inside the mutate call:

ex1 <- filter(ex, group==1)

For each timestamp we make a list of subgroups:

split_ <- split(ex1$subgroup,ex1$timestamp)
# $`2017-09-09 16:38:34`
# [1] 36
# 
# $`2017-09-09 16:38:35`
# [1] 36
# 
# $`2017-09-09 16:38:36`
# [1] 36 35
# 
# $`2017-09-09 16:38:41`
# [1] 36 35 35 36

The order of the last item should be changed, 35 should come before 36, because it's used last in the 3rd element. As intersect keeps the order of items in 1st argument, I can get the right order for the last item like this :

intersect(c(rev(split_[[3]]), split_[[4]]),
          split_[[4]])
# [1] 35 36

To apply this transformation to all elements I use purrr::accumulate, as I always need the last computed order to compute the next one :

acc_ <- accumulate(split_, ~intersect(c(rev(.x),.y),.y))
# [[1]]
# [1] 36
# 
# [[2]]
# [1] 36
# 
# [[3]]
# [1] 36 35
# 
# [[4]]
# [1] 35 36

If I use split_ and acc_ with match I can get the order than these elements should have in our output

map2(split_ , acc_, match)
# $`2017-09-09 16:38:34`
# [1] 1
# 
# $`2017-09-09 16:38:35`
# [1] 1
# 
# $`2017-09-09 16:38:36`
# [1] 1 2
# 
# $`2017-09-09 16:38:41`
# [1] 2 1 1 2

Then I can unlist it get my order_ column, and sort by order_ to get the desired output.

这篇关于如何对数据中的关系进行排序,以使先前观察到的值首先出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆