R-计算相似数据集之间的差异(相似性度量) [英] R - Calculate difference (similarity measure) between similar datasets
问题描述
我已经看到许多与此主题相关的问题,但尚未找到答案.如果我错过了一个确实回答了这个问题的问题,请务必对此做一个标记,并指出我们的问题.
I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question.
场景:我们有一个基准数据集,我们有插补方法,我们有系统地从基准中删除值,并使用两种不同的插补方法.因此,我们有一个基准,imputedData1和imputedData2.
Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2.
问题:是否有一个函数可以产生一个数字,该数字表示基准和imputedData1之间的差异或/和基准与imputedData2之间的差异.即function(benchmark,imputedData1)= 3.3和function(benchmark,imputedData2)= 2.8
Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2. Ie function(benchmark, imputedData1) = 3.3 and function(benchmark, imputedData2) = 2.8
注意:数据集是数字的,数据集的大小相同,如果可能的话,方法应该在数据级别上起作用(即,不创建回归并比较回归-除非它可以与任何数字数据集一起使用).
Note: Datasets are numerical, datasets are the same size, method should work at the data level if possible (ie not creating a regression and comparing regressions - unless it can work with ANY numerical dataset).
可重现的数据集,仅在第一行中进行了更改:
Reproducible datasets, they have only been changed in the first row:
基准:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData1:
imputedData1:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
imputedData2:
imputedData2:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
我曾尝试使用RMSE(均方根误差),但效果不是很好,因此我试图找到其他方法来解决此问题.
I have tried to use RMSE (root mean squared error) but it didn't work very well so I am trying to find other ways to tackle this problem.
推荐答案
您还可以签出软件包ftsa
.它可以计算出大约 20个错误度量.在您的情况下,按比例缩放的误差会很有意义,因为各列的单位不同.
You could also check out package ftsa
. It has about 20 error measures that can be calculated. In your case, a scaled error would make sense as the units differ from column to column.
library(ftsa)
error(forecast=unlist(imputedData1),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.035136
error(forecast=unlist(imputedData2),true=unlist(bench),
insampletrue = unlist(bench), method = "mase")
[1] 0.031151
数据
bench <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData1 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
22.0 4 108.0 100 3.90 2.200 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
imputedData2 <- read.table(text='mpg cyl disp hp drat wt qsec vs am gear carb
18.0 6 112.0 105 3.90 2.620 16.46 0 0 3 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4',header=TRUE,stringsAsFactors=FALSE)
这篇关于R-计算相似数据集之间的差异(相似性度量)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!