使用 dplyr 进行线性插值 [英] Linear Interpolation using dplyr
问题描述
我正在尝试使用 zoo
库中的 na.approx()
函数(与 xts
结合使用)来插入缺失值多次测量的多个个体的重复测量数据中的值.
示例数据...
event.date <- c(2010-05-25"、2010-09-10"、2011-05-13"、2012-03-28"、2013-03-07",2014-02-13"、2010-06-11"、2010-09-10"、2011-05-13"、2012-03-28"、2013-03-07"、2014-02-13")变量 <-c("neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd",wbody.bmd"、wbody.bmd"、wbody.bmd"、wbody.bmd"、wbody.bmd"、wbody.bmd")值 <- c(0.7490, 0.7615, 0.7900, 0.7730, NA, 0.7420, 1.0520, 1.0665, 1.0760,1.0870,北美,1.0550)## 绑定到数据框df <- data.frame(event.date, variable, value)rm(事件日期,变量,值)## 转换日期df$event.date <- as.Date(df$event.date)## 加载库图书馆(magrittr)图书馆(xts)图书馆(动物园)
我可以使用 xts()
和 na.approx()
为给定的人的单个结果插入一个缺失的数据点....
## 子集一个变量wbody <-subset(df, variable == "wbody.bmd")## order/index 然后插值xts(wbody$value, wbody$event.date)%>%na.approx()2010-06-11 1.0520002010-09-10 1.0665002011-05-13 1.0760002012-03-28 1.0870002013-03-07 1.0709772014-02-13 1.055000
返回矩阵并不理想,但我可以解决这个问题.不过,我遇到的主要问题是我对多人有多种结果.我,也许天真地认为,既然这是一个拆分-应用-组合问题,我可以利用 dplyr
以下列方式实现这一点...
## 加载库图书馆(dplyr)##分组然后排列数据(确保日期正确)df%>%group_by(变量)%>%安排(变量,事件.日期)%>%xts(.$value, .$event.date) %>%na.approx()
<块引用>
xts(., .$value, .$event.date) 中的错误:order.by 需要一个合适的基于时间的对象
似乎 dplyr
不能很好地与 xts
/zoo
配合使用,我花了几个小时四处寻找找到有关如何在 R 中插入缺失数据点的教程/示例,但我发现的只是单个案例示例,到目前为止,我一直无法找到有关如何为多人的多个站点执行此操作的任何内容(我意识到我可以通过将我的数据重塑为广泛的方式使其成为多人问题,但这仍然无法解决我遇到的问题).
任何关于如何进行的想法/建议/见解将不胜感激.
谢谢
澄清一些函数来自 zoo
包.
我采用的解决方案基于@docendodiscimus 的第一条评论
我一直在做这种方法,而不是尝试创建一个新的数据框,只是通过利用 dplyr
的 mutate()代码>函数.
我的代码现在...
df %>%group_by(变量)%>%安排(变量,事件.日期)%>%变异(ip.value = na.approx(value, maxgap = 4, rule = 2))
maxgap
允许最多四个连续的 NA
,而 rule
选项允许外推到侧翼时间点.
I'm trying to use the na.approx()
function from the zoo
library (in conjunction with xts
) to interpolate missing values from repeated measures data for multiple individuals with multiple measurements.
Sample data...
event.date <- c("2010-05-25", "2010-09-10", "2011-05-13", "2012-03-28", "2013-03-07",
"2014-02-13", "2010-06-11", "2010-09-10", "2011-05-13", "2012-03-28",
"2013-03-07", "2014-02-13")
variable <- c("neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd", "neck.bmd",
"wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd", "wbody.bmd")
value <- c(0.7490, 0.7615, 0.7900, 0.7730, NA, 0.7420, 1.0520, 1.0665, 1.0760,
1.0870, NA, 1.0550)
## Bind into a data frame
df <- data.frame(event.date, variable, value)
rm(event.date, variable, value)
## Convert date
df$event.date <- as.Date(df$event.date)
## Load libraries
library(magrittr)
library(xts)
library(zoo)
I can interpolate one missing data point for a single outcome for a given person using xts()
and na.approx()
....
## Subset one variable
wbody <- subset(df, variable == "wbody.bmd")
## order/index and then interpolate
xts(wbody$value, wbody$event.date) %>%
na.approx()
2010-06-11 1.052000
2010-09-10 1.066500
2011-05-13 1.076000
2012-03-28 1.087000
2013-03-07 1.070977
2014-02-13 1.055000
Not ideal having a matrix returned, but I can work around that. The main problem I have though is that I've multiple outcomes for multiple people. I, perhaps naively thought that since this is therefore a split-apply-combine problem that I could utilise dplyr
to achieve this in the following manner...
## Load library
library(dplyr)
## group and then arrange the data (to ensure dates are correct)
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
xts(.$value, .$event.date) %>%
na.approx()
Error in xts(., .$value, .$event.date) : order.by requires an appropriate time-based object
It seems that dplyr
doesn't play well with xts
/zoo
and I've spent a couple of hours searching around trying to find tutorials/examples on how to interpolate missing data points in R, but all I've found are single case examples and so far I've been unable to find anything on how to do this for multiple sites for multiple people (I realise I could make it just a multiple people problem by reshaping my data to wide but that still wouldn't solve the problem I'm encountering).
Any thoughts/advice/insights on how to proceed would be greatly appreciated.
Thanks
EDIT : Clarification that some functions come from zoo
package.
The solution I've gone with is based on the first comment from @docendodiscimus
Rather than attempt to create a new data frame as I'd been doing this approach simply adds columns to the existing data frame by taking advantage of dplyr
's mutate()
function.
My code is now...
df %>%
group_by(variable) %>%
arrange(variable, event.date) %>%
mutate(ip.value = na.approx(value, maxgap = 4, rule = 2))
The maxgap
allows upto four consecutive NA
's, whilst the rule
option allows extrapolation into the flanking time points.
这篇关于使用 dplyr 进行线性插值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!