用dtw计算距离矩阵 [英] Calculating a distance matrix by dtw

查看:272
本文介绍了用dtw计算距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在第1天到第26天的时间序列中,我有两个矩阵用于控制和治疗的标准化读取计数.我想通过动态时间包装计算距离矩阵,然后将其用于聚类,但似乎太复杂了.我是这样做的;谁可以帮助您进一步澄清?非常感谢

I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterward use that for clustering but seems too complicated. I did so; who can help for more clarification please? Thanks a lot

> head(control[,1:4])
               MAST2     WWC2  PHYHIPL   R3HDM2
Control_D1  6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
> 

> head(lead[,1:4])
              MAST2     WWC2  PHYHIPL   R3HDM2
Lead30_D1  6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1]   26 2603
> dim(lead)
[1]   26 2603
library(dtw)

for (i in control) { 
  for (j in lead) { 
    result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance 
  }
}

Error in lead[, , j] : incorrect number of dimensions

推荐答案

已经有与您类似的问题, 但是答案还不太详细. 这是您需要了解的细目, 在R的特定情况下.

There have already been questions similar to yours, but the answers haven't been too detailed. Here's a breakdown of what you need to know, in the specific case of R.

proxy 软件包专门用于计算交叉距离矩阵. 您应该检查其插图,以了解它已经实施了哪些措施. 使用示例:

The proxy package is made specifically for the calculation of cross-distance matrices. You should check its vignette to know which measures are already implemented by it. An example of its use:

set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)

suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean", 
                               upper = TRUE, diag = TRUE)
print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

注意:在时间序列中, proxy将矩阵中的每个视为一个序列, 可以通过上面的sample_data5x10矩阵,而得到的交叉距离矩阵是5x5来确认.

Note: in the context of time series, proxy treats each row in a matrix as a series, which can be confirmed by the fact that sample_data above is a 5x10 matrix and the resulting cross-distance matrix is 5x5.

dtw 包实现了DTW的许多变体, 并且还利用了proxy. 您可以使用以下方法计算DTW距离矩阵:

The dtw package implements many variations of DTW, and it also leverages proxy. You could calculate a DTW distance matrix with:

suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw", 
                           upper = TRUE, diag = TRUE)
print(distance_matrix)
#>          1        2        3        4        5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000

使用自定义距离

关于proxy的一件好事是,它使您可以选择注册自定义功能. 您似乎对DTW的规范化版本感兴趣, 因此您可以执行以下操作:

Using custom distances

One nice thing about proxy is that it gives you the option to register custom functions. You seem to be interested in the normalized version of DTW, so you could do something like this:

ndtw <- function(x, y = NULL, ...) {
    dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}

pr_DB$set_entry(
  FUN = ndtw,
  names = "ndtw",
  loop = TRUE,
  distance = TRUE
)

ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
                            upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#>           1         2         3         4         5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

有关更多信息,请参见pr_DB的文档.

See the documentation of pr_DB for more information.

dtwclust 软件包 (我做的) 实现了DTW的基本但较快的版本,该版本可以使用多线程并还利用proxy:

The dtwclust package (which I made) implements a basic but faster version of DTW which can use multi-threading and also leverages proxy:

suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#>      [,1]      [,2]      [,3]      [,4]      [,5]     
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000

dtw_basic实现仅支持两种步骤模式和一种窗口类型, 但这要快得多:

The dtw_basic implementation only supports two step patterns and one window type, but it is considerably faster:

suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
  proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
  proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)

Unit: microseconds
                                                                                        expr      min       lq     mean
 proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba",      window.size = 5L) 5279.124 5621.742 6070.069
                            proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)  657.966  710.418  776.474
   median       uq       max neval cld
 5802.354 6348.199 10411.000   100   b
  752.282  814.037  1161.626   100  a

parallelDist 包中还包含另一种多线程实现, 尽管我还没有亲自测试过.

Another multi-threaded implementation is included in the parallelDist package, although I haven't personally tested it.

单个多元序列通常是一个矩阵,其中时间跨行,而多个变量跨列. DTW也适用于他们:

A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns. DTW also works for them:

mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421

proxy的优点是它也可以计算列表中包含的对象之间的距离, 因此您可以在矩阵列表中放置几个​​多元系列:

The nice thing about proxy is that it can calculate distances between objects contained in lists too, so you can put several multivariate series in lists of matrices:

mv_series <- lapply(1L:5L, function(dummy) {
  matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})

mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#>      [,1]     [,2]     [,3]     [,4]     [,5]    
#> [1,]  0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599  0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470  0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376  0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111  0.00000

您的案子

无论您选择什么, 您可能可以使用proxy来获得结果, 但是由于您还没有提供全部数据, 我不能给你一个更具体的例子. 我想dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE)会给您一对系列之间的距离, 假设您将每个变量都视为包含4个变量的多变量序列.

Your case

Regardless of what you choose, you can probably use proxy to get your result, but since you haven't provided your whole data, I can't give you a more specific example. I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE) would give you the distance between one pair of series, assuming you're treating each one as a multivariate series with 4 variables.

这篇关于用dtw计算距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆