R-计算相似数据集之间的差异(相似性度量) [英] R - Calculate difference (similarity measure) between similar datasets

查看:168
本文介绍了R-计算相似数据集之间的差异(相似性度量)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到许多与此主题相关的问题,但尚未找到答案.如果我错过了一个确实回答了这个问题的问题,请务必对此做一个标记,并指出我们的问题.

I have seen many questions that touch on this topic but haven't yet found an answer. If I have missed a question that does answer this question, please do mark this and point us to the question.

场景:我们有一个基准数据集,我们有插补方法,我们有系统地从基准中删除值,并使用两种不同的插补方法.因此,我们有一个基准,imputedData1和imputedData2.

Scenario: We have a benchmark dataset, we have imputation methods, we systematically delete values from the benchmark and use two different imputation methods. Thus we have a benchmark, imputedData1 and imputedData2.

问题:是否有一个函数可以产生一个数字,该数字表示基准和imputedData1之间的差异或/和基准与imputedData2之间的差异.即function(benchmark,imputedData1)= 3.3和function(benchmark,imputedData2)= 2.8

Question: Is there a function that can produce a number that represents the difference between the benchmark and imputedData1 or/and the difference between the benchmark and imputedData2. Ie function(benchmark, imputedData1) = 3.3 and function(benchmark, imputedData2) = 2.8

注意:数据集是数字的,数据集的大小相同,如果可能的话,方法应该在数据级别上起作用(即,不创建回归并比较回归-除非它可以与任何数字数据集一起使用).

Note: Datasets are numerical, datasets are the same size, method should work at the data level if possible (ie not creating a regression and comparing regressions - unless it can work with ANY numerical dataset).

可重现的数据集,仅在第一行中进行了更改:

Reproducible datasets, they have only been changed in the first row:

基准:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

imputedData1:

imputedData1:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         22.0   4 108.0 100 3.90 2.200 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

imputedData2:

imputedData2:

> head(mtcars,n=10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         18.0   6 112.0 105 3.90 2.620 16.46  0  0    3    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

我曾尝试使用RMSE(均方根误差),但效果不是很好,因此我试图找到其他方法来解决此问题.

I have tried to use RMSE (root mean squared error) but it didn't work very well so I am trying to find other ways to tackle this problem.

推荐答案

您还可以签出软件包ftsa.它可以计算出大约 20个错误度量.在您的情况下,按比例缩放的误差会很有意义,因为各列的单位不同.

You could also check out package ftsa. It has about 20 error measures that can be calculated. In your case, a scaled error would make sense as the units differ from column to column.

library(ftsa)
error(forecast=unlist(imputedData1),true=unlist(bench), 
          insampletrue = unlist(bench), method = "mase")
[1] 0.035136

error(forecast=unlist(imputedData2),true=unlist(bench), 
          insampletrue = unlist(bench), method = "mase")
[1] 0.031151

数据

bench <- read.table(text='mpg cyl  disp  hp drat    wt  qsec vs am gear carb
21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4',header=TRUE,stringsAsFactors=FALSE)

imputedData1 <- read.table(text='mpg cyl  disp  hp drat    wt  qsec vs am gear carb
22.0   4 108.0 100 3.90 2.200 16.46  0  1    4    4
21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4',header=TRUE,stringsAsFactors=FALSE)

imputedData2 <- read.table(text='mpg cyl  disp  hp drat    wt  qsec vs am gear carb
18.0   6 112.0 105 3.90 2.620 16.46  0  0    3    4
21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4',header=TRUE,stringsAsFactors=FALSE)

这篇关于R-计算相似数据集之间的差异(相似性度量)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆