如何从数据帧中消除不一致(时间序列) [英] How to remove inconsistencies from dataframe (time series)
问题描述
假设我们有以下数据框:
Let's say that we have this dataframe:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")
列 ID
表示主题ID。
列访问
表示一系列访问
列时间
表示达到特定时间所花费的时间状态
Column Time
indicates the time that has passed to reach a certain "State"
列状态
表示某种疾病的严重程度,其中5表示死亡。这意味着您可以从更坏的状态波动到更好的状态,但是您永远无法从第5类中得到改善,因为您已经死了。
Column State
indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.
我只想识别那些主题从类别5改进到更好的类别,因为这些是数据帧中的错误(即第13和16行)。
I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).
此外,我想删除那些行一个对象似乎死了不止一次(即第18行)。
Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).
我提出了类似的问题之前,但这很笼统,它暗示从数据集中删除了所有处于较好状态的波动,这并不是我真正想要的。
I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.
推荐答案
修改后的问题的答案
OP通过请求实质性地修改了问题所有的行都被认为是错误的,出现在状态5(死亡)的第一次出现之后。这包括错误的恢复(如第13和16行)以及重复死亡(如第17和18行)。
Answer to modified question
The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).
要解决此问题,需要采用完全不同的方法。一种可能性是使用非等额联接:
An answer to this requires a complete different approach. One possibility is to use a non-equi join:
library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
ID Visit Time State error
1: A 1 10.0 1 NA
2: A 2 12.5 3 NA
3: A 3 15.0 4 NA
4: B 1 2.0 1 NA
5: B 2 3.4 2 NA
6: B 3 5.7 3 NA
7: B 2 8.0 4 NA
8: B 3 9.5 3 NA
9: C 1 1.0 2 NA
10: C 2 5.6 2 NA
11: C 3 8.9 3 NA
12: C 4 10.0 5 NA
13: C 5 11.0 2 TRUE
14: D 1 2.0 3 NA
15: D 2 3.4 5 NA
16: D 3 6.0 4 TRUE
17: D 4 8.0 5 TRUE
18: D 5 10.5 5 TRUE
第一次访问的次数
x[, first(Visit[State == 5]), by = ID]
ID V1
1: C 4
2: D 2
在随后的 non-equi join 仅标记那些出现在第一个State 5事件之后的行。
In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.
x <- data.frame(
ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))
这篇关于如何从数据帧中消除不一致(时间序列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!