如何从数据帧中消除不一致(时间序列) [英] How to remove inconsistencies from dataframe (time series)

查看:175
本文介绍了如何从数据帧中消除不一致(时间序列)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有以下数据框:

Let's say that we have this dataframe:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

ID 表示主题ID。

访问表示一系列访问

时间表示达到特定时间所花费的时间状态

Column Time indicates the time that has passed to reach a certain "State"

状态表示某种疾病的严重程度,其中5表示死亡。这意味着您可以从更坏的状态波动到更好的状态,但是您永远无法从第5类中得到改善,因为您已经死了。

Column State indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.

我只想识别那些主题从类别5改进到更好的类别,因为这些是数据帧中的错误(即第13和16行)。

I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).

此外,我想删除那些行一个对象似乎死了不止一次(即第18行)。

Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).

我提出了类似的问题之前,但这很笼统,它暗示从数据集中删除了所有处于较好状态的波动,这并不是我真正想要的。

I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.

推荐答案

修改后的问题的答案



OP通过请求实质性地修改了问题所有的行都被认为是错误的,出现在状态5(死亡)的第一次出现之后。这包括错误的恢复(如第13和16行)以及重复死亡(如第17和18行)。

Answer to modified question

The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).

要解决此问题,需要采用完全不同的方法。一种可能性是使用非等额联接

An answer to this requires a complete different approach. One possibility is to use a non-equi join:

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]




    ID Visit Time State error
 1:  A     1 10.0     1    NA
 2:  A     2 12.5     3    NA
 3:  A     3 15.0     4    NA
 4:  B     1  2.0     1    NA
 5:  B     2  3.4     2    NA
 6:  B     3  5.7     3    NA
 7:  B     2  8.0     4    NA
 8:  B     3  9.5     3    NA
 9:  C     1  1.0     2    NA
10:  C     2  5.6     2    NA
11:  C     3  8.9     3    NA
12:  C     4 10.0     5    NA
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3    NA
15:  D     2  3.4     5    NA
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5  TRUE
18:  D     5 10.5     5  TRUE


第一次访问的次数

x[, first(Visit[State == 5]), by = ID]




   ID V1
1:  C  4
2:  D  2


在随后的 non-equi join 仅标记那些出现在第一个State 5事件之后的行。

In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))

这篇关于如何从数据帧中消除不一致(时间序列)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆