如何对数据中的关系进行排序,以使先前观察到的值首先出现 [英] How to order the ties in data so that the previously observed value appears first
问题描述
ex<-structure(list(group = c(1,1,1,1,1,1,1,1,2,2,2,2,
2,2,2,2,2),时间戳=结构(c(1504975114,1504975115,
1504975116,1504975116,1504975121,1504975121,1504975121,1504975121,
1504963482,1504963486,1504963486,1504964343 ,1504964343、1504964394,
1504964394、1504964394、1504964394),类别= c( POSIXct, POSIXt
),tzone = UTC),子组= c(36L,36L,36L, 35L,36L,35L,
35L,36L,43L,43L,14L,14L,14L,14L,14L,43L,43L),A = c(1L,
49L,1L,74L,12L ,61L,5L,5L,1L,30L,30L,18L,19L,32L,
40L,32L,40L),B = c(1L,1L,0L,1L,1L,1L,0L,1L, 1L,1L,
1L,1L,1L,1L,1L,1L,1L)),.names = c( group, timestamp,
subgroup, A, B),类= c( tbl_df, tbl, data.frame
),row.names = c(NA,-17L))
我有类似上面的数据。我想按时间戳对组
中的数据进行排序,但还要注意如何处理时间戳中的联系。准确地说,如果两个观测值具有相同的时间戳,则我想首先拥有这个观测值,它具有与上一个时间戳相同的子组
id。因此所需的输出如下所示:
#小标题:17 x 5
组时间戳子组AB
< dbl> < dttm> < int> < int> < int>
1 1.00 2017-09-09 16:38:34 36 1 1
2 1.00 2017-09-09 16:38:35 36 49 1
3 1.00 2017-09-09 16 :38:36 36 1 0
4 1.00 2017-09-09 16:38:36 35 74 1
5 1.00 2017-09-09 16:38:41 35 61 1
6 1.00 2017-09-09 16:38:41 35 5 0
7 1.00 2017-09-09 16:38:41 36 12 1
8 1.00 2017-09-09 16:38:41 36 5 1
9 2.00 2017-09-09 13:24:42 43 1 1
10 2.00 2017-09-09 13:24:46 43 30 1
11 2.00 2017-09- 09 13:24:46 14 30 1
12 2.00 2017-09-09 13:39:03 14 18 1
13 2.00 2017-09-09 13:39:03 14 19 1
14 2.00 2017-09-09 13:39:54 14 32 1
15 2.00 2017-09-09 13:39:54 14 40 1
16 2.00 2017-09-09 13:39: 54 43 32 1
17 2.00 2017-09-09 13:39:54 43 40 1
我该怎么做?
以下是使用 tidyverse
的想法:
library(tidyverse)
ex%>%
group_by(group)%&%
mutate(order = map2(
split_<-split(子组,时间戳),
累加(split_,〜intersect(c(rev(.x)、. y)、. y)),
匹配)%>%未列出)%&%;%
ranging(group,timestamp,order)
## A小食:17 x 6
##组:group [2]
#group timestamp子组AB order
#< dbl> < dttm> < int> < int> < int> < int>
#1 1 2017-09-09 16:38:34 36 1 1 1
#2 1 2017-09-09 16:38:35 36 49 1 1
#3 1 2017 -09-09 16:38:36 36 1 0 1
#4 1 2017-09-09 16:38:36 35 74 1 2
#5 1 2017-09-09 16:38: 41 35 61 1 1
#6 1 2017-09-09 16:38:41 35 5 0 1
#7 1 2017-09-09 16:38:41 36 12 1 2
#8 1 2017-09-09 16:38:41 36 5 1 2
#9 2 2017-09-09 13:24:42 43 1 1 1
#10 2 2017-09- 09 13:24:46 43 30 1 1
#11 2 2017-09-09 13:24:46 14 30 1 2
#12 2 2017-09-09 13:39:03 14 18 1 1
#13 2 2017-09-09 13:39:03 14 19 1 1
#14 2 2017-09-09 13:39:54 14 32 1 1
#15 2 2017-09-09 13:39:54 14 40 1 1
#16 2 2017-09-09 13:39:54 43 32 1 2
#17 2 2017-09-09 13:39:54 43 40 1 2
我做了假设时间戳是事先进行排序的,如果没有,则首先使用 ex%>%range(group,timestamp)%>%...
进行排序。
您可以添加%>%select(-order)%>%ungroup
以获得精确的所需输出(我离开了
解释
让我们仅保留第1组来说明什么发生在mutate调用中:
ex1<-filter(ex,group == 1)
对于每个时间戳,我们列出一个子组列表:
split_<-split( ex1 $ subgroup,ex1 $ timestamp)
#$`2017-09-09 16:38:34`
#[1] 36
#
#$`2017-09 -09 16:38:35`
#[1] 36
#
#$`2017-09-09 16:38:36`
#[1] 36 35
#
#$`2017-09-09 16:38:41`
#[1] 36 35 35 36
l的顺序ast项应该更改, 35
应该在 36
之前,因为它在第3个元素中最后使用。由于 intersect
将项目的顺序保留在第一个参数中,所以我可以像这样最后一个项目获得正确的顺序:
intersect(c(rev(split _ [[3]]),split _ [[4]]),
split _ [[4]])
#[1] 35 36
要将此转换应用于所有元素,我使用 purrr :: accumulate
,因为我总是需要最后一个计算顺序来计算下一个:
acc_<-accumulate(split_,〜intersect(c(rev (.x),. y),. y))
#[[1]]
#[1] 36
#
#[[2]]
#[1] 36
#
#[[3]]
#[1] 36 35
#
#[[4]]
#[1] 35 36
如果我使用 split _
和 acc _
与匹配
我可以获得比这些元素在输出中应有的订单
map2(split_,acc_,match)
#$`2017-09-09 16:38:34`
#[1] 1
#
#$`2017-09-09 16:38:35`
#[1] 1
#
#$`2017-09-09 16:38:36 `
#[1] 1 2
#
#$`2017-09-09 16:38:41`
#[1] 2 1 1 2
然后我可以取消列表
,它会得到我的 order _
列,并按 order _
以获得所需的输出。
ex <- structure(list(group = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2), timestamp = structure(c(1504975114, 1504975115,
1504975116, 1504975116, 1504975121, 1504975121, 1504975121, 1504975121,
1504963482, 1504963486, 1504963486, 1504964343, 1504964343, 1504964394,
1504964394, 1504964394, 1504964394), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), subgroup = c(36L, 36L, 36L, 35L, 36L, 35L,
35L, 36L, 43L, 43L, 14L, 14L, 14L, 14L, 14L, 43L, 43L), A = c(1L,
49L, 1L, 74L, 12L, 61L, 5L, 5L, 1L, 30L, 30L, 18L, 19L, 32L,
40L, 32L, 40L), B = c(1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("group", "timestamp",
"subgroup", "A", "B"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -17L))
I've got a data like above. I want to sort the data within group
by timestamp, but also pay attention how the ties in timestamp are handle. Precisely speaking, if two observations havae the same timestamp, I would like to have as first this observation, that has the same subgroup
id as the value from previous timestamp. So the desired output would look like below:
# A tibble: 17 x 5
group timestamp subgroup A B
<dbl> <dttm> <int> <int> <int>
1 1.00 2017-09-09 16:38:34 36 1 1
2 1.00 2017-09-09 16:38:35 36 49 1
3 1.00 2017-09-09 16:38:36 36 1 0
4 1.00 2017-09-09 16:38:36 35 74 1
5 1.00 2017-09-09 16:38:41 35 61 1
6 1.00 2017-09-09 16:38:41 35 5 0
7 1.00 2017-09-09 16:38:41 36 12 1
8 1.00 2017-09-09 16:38:41 36 5 1
9 2.00 2017-09-09 13:24:42 43 1 1
10 2.00 2017-09-09 13:24:46 43 30 1
11 2.00 2017-09-09 13:24:46 14 30 1
12 2.00 2017-09-09 13:39:03 14 18 1
13 2.00 2017-09-09 13:39:03 14 19 1
14 2.00 2017-09-09 13:39:54 14 32 1
15 2.00 2017-09-09 13:39:54 14 40 1
16 2.00 2017-09-09 13:39:54 43 32 1
17 2.00 2017-09-09 13:39:54 43 40 1
How can I do this?
Here's an idea using tidyverse
:
library(tidyverse)
ex %>%
group_by(group) %>%
mutate(order = map2(
split_ <- split(subgroup,timestamp),
accumulate(split_, ~intersect(c(rev(.x),.y),.y)),
match) %>% unlist) %>%
arrange(group,timestamp,order)
# # A tibble: 17 x 6
# # Groups: group [2]
# group timestamp subgroup A B order
# <dbl> <dttm> <int> <int> <int> <int>
# 1 1 2017-09-09 16:38:34 36 1 1 1
# 2 1 2017-09-09 16:38:35 36 49 1 1
# 3 1 2017-09-09 16:38:36 36 1 0 1
# 4 1 2017-09-09 16:38:36 35 74 1 2
# 5 1 2017-09-09 16:38:41 35 61 1 1
# 6 1 2017-09-09 16:38:41 35 5 0 1
# 7 1 2017-09-09 16:38:41 36 12 1 2
# 8 1 2017-09-09 16:38:41 36 5 1 2
# 9 2 2017-09-09 13:24:42 43 1 1 1
# 10 2 2017-09-09 13:24:46 43 30 1 1
# 11 2 2017-09-09 13:24:46 14 30 1 2
# 12 2 2017-09-09 13:39:03 14 18 1 1
# 13 2 2017-09-09 13:39:03 14 19 1 1
# 14 2 2017-09-09 13:39:54 14 32 1 1
# 15 2 2017-09-09 13:39:54 14 40 1 1
# 16 2 2017-09-09 13:39:54 43 32 1 2
# 17 2 2017-09-09 13:39:54 43 40 1 2
I made the assumption that timestamp are sorted before hand, if not, sort as a first step with ex %>% arrange(group, timestamp) %>% ...
.
You can add %>% select(-order) %>% ungroup
to get precisely your desired output (I left it this way to make it easier to understand).
explanations
Let's keep only group 1 to illustrate what happens inside the mutate call:
ex1 <- filter(ex, group==1)
For each timestamp we make a list of subgroups:
split_ <- split(ex1$subgroup,ex1$timestamp)
# $`2017-09-09 16:38:34`
# [1] 36
#
# $`2017-09-09 16:38:35`
# [1] 36
#
# $`2017-09-09 16:38:36`
# [1] 36 35
#
# $`2017-09-09 16:38:41`
# [1] 36 35 35 36
The order of the last item should be changed, 35
should come before 36
, because it's used last in the 3rd element. As intersect
keeps the order of items in 1st argument, I can get the right order for the last item like this :
intersect(c(rev(split_[[3]]), split_[[4]]),
split_[[4]])
# [1] 35 36
To apply this transformation to all elements I use purrr::accumulate
, as I always need the last computed order to compute the next one :
acc_ <- accumulate(split_, ~intersect(c(rev(.x),.y),.y))
# [[1]]
# [1] 36
#
# [[2]]
# [1] 36
#
# [[3]]
# [1] 36 35
#
# [[4]]
# [1] 35 36
If I use split_
and acc_
with match
I can get the order than these elements should have in our output
map2(split_ , acc_, match)
# $`2017-09-09 16:38:34`
# [1] 1
#
# $`2017-09-09 16:38:35`
# [1] 1
#
# $`2017-09-09 16:38:36`
# [1] 1 2
#
# $`2017-09-09 16:38:41`
# [1] 2 1 1 2
Then I can unlist
it get my order_
column, and sort by order_
to get the desired output.
这篇关于如何对数据中的关系进行排序,以使先前观察到的值首先出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!