如何从数据帧中消除不一致（时间序列） [英] How to remove inconsistencies from dataframe (time series)

查看：175 发布时间：2020/6/2 20:39:43 r dataframe time-series aggregate

本文介绍了如何从数据帧中消除不一致（时间序列）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我们有以下数据框：

Let's say that we have this dataframe:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

列 ID 表示主题ID。

列访问表示一系列访问

列时间表示达到特定时间所花费的时间状态

Column Time indicates the time that has passed to reach a certain "State"

列状态表示某种疾病的严重程度，其中5表示死亡。这意味着您可以从更坏的状态波动到更好的状态，但是您永远无法从第5类中得到改善，因为您已经死了。

Column State indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.

我只想识别那些主题从类别5改进到更好的类别，因为这些是数据帧中的错误（即第13和16行）。

I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).

此外，我想删除那些行一个对象似乎死了不止一次（即第18行）。

Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).

我提出了类似的问题之前，但这很笼统，它暗示从数据集中删除了所有处于较好状态的波动，这并不是我真正想要的。

I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.

修改后的问题的答案

OP通过请求实质性地修改了问题所有的行都被认为是错误的，出现在状态5（死亡）的第一次出现之后。这包括错误的恢复（如第13和16行）以及重复死亡（如第17和18行）。

Answer to modified question

The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).

要解决此问题，需要采用完全不同的方法。一种可能性是使用非等额联接：

An answer to this requires a complete different approach. One possibility is to use a non-equi join:

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]

    ID Visit Time State error
 1:  A     1 10.0     1    NA
 2:  A     2 12.5     3    NA
 3:  A     3 15.0     4    NA
 4:  B     1  2.0     1    NA
 5:  B     2  3.4     2    NA
 6:  B     3  5.7     3    NA
 7:  B     2  8.0     4    NA
 8:  B     3  9.5     3    NA
 9:  C     1  1.0     2    NA
10:  C     2  5.6     2    NA
11:  C     3  8.9     3    NA
12:  C     4 10.0     5    NA
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3    NA
15:  D     2  3.4     5    NA
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5  TRUE
18:  D     5 10.5     5  TRUE

第一次访问的次数

x[, first(Visit[State == 5]), by = ID]

   ID V1
1:  C  4
2:  D  2

在随后的 non-equi join 仅标记那些出现在第一个State 5事件之后的行。

In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))

这篇关于如何从数据帧中消除不一致（时间序列）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从数据帧中消除不一致（时间序列） [英] How to remove inconsistencies from dataframe (time series)

问题描述

推荐答案

修改后的问题的答案

Answer to modified question

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从数据帧中消除不一致（时间序列） [英] How to remove inconsistencies from dataframe (time series)

问题描述

推荐答案

修改后的问题的答案

Answer to modified question

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭