如何使用缺失值执行RMSE? [英] How to perform RMSE with missing values?
问题描述
我有一个庞大的数据集,其中包含679行和16列,缺少30%的缺失值.因此,我决定使用来自impute软件包的impute.knn函数来估算这些缺失值,然后得到了一个数据集,其中包含679行16列,但是没有缺失值.
I have a huge dataset with 679 rows and 16 columns with 30 % of missing values. So I decided to impute this missing values with the function impute.knn from the package impute and I got a dataset with 679 rows and 16 columns but without the missing values.
但是现在我想使用RMSE检查准确性,我尝试了2种选择:
But now I want to check the accuracy using the RMSE and I tried 2 options:
- 加载软件包
hydroGOF
并应用rmse
函数 -
sqrt(mean (obs-sim)^2), na.rm=TRUE)
- load the package
hydroGOF
and apply thermse
function sqrt(mean (obs-sim)^2), na.rm=TRUE)
在两种情况下,我会出现错误:errors in sim .obs: non numeric argument to binary operator.
In two situations I have the error: errors in sim .obs: non numeric argument to binary operator.
之所以发生这种情况,是因为原始数据集包含一个NA
值(某些值缺失).
This is happening because the original data set contains an NA
value (some values are missing).
如果删除缺失值,如何计算RMSE?然后obs
和sim
将具有不同的大小.
How can I calculate the RMSE if I remove the missing values? Then obs
and sim
will have different sizes.
推荐答案
简单...
sqrt( sum( (df$model - df$measure)^2 , na.rm = TRUE ) / nrow(df) )
显然,假设您的数据框称为df
,并且您必须确定 N (即nrow(df)
包括两行缺少数据的行;是否要排除 N 观察中的这些?我想是的,所以您可能想使用sum( !is.na(df$measure) )
来代替nrow(df)
),或者紧跟在@Joshua之后>
Obviously assuming your dataframe is called df
and you have to decide on your N ( i.e. nrow(df)
includes the two rows with missing data; do you want to exclude these from N observations? I'd guess yes, so instead of nrow(df)
you probably want to use sum( !is.na(df$measure) )
) or, following @Joshua just
sqrt( mean( (df$model-df$measure)^2 , na.rm = TRUE ) )
这篇关于如何使用缺失值执行RMSE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!