使用numpy或cython进行高效的成对DTW计算 [英] Efficient pairwise DTW calculation using numpy or cython
问题描述
我正在尝试计算numpy数组中包含的多个时间序列之间的成对距离.请参见下面的代码
I am trying to calculate the pairwise distances between multiple time-series contained in a numpy array. Please see the code below
print(type(sales))
print(sales.shape)
<class 'numpy.ndarray'>
(687, 157)
因此, sales
包含687个长度为157的时间序列.使用pdist计算时间序列之间的DTW距离.
So, sales
contains 687 time series of length 157. Using pdist to calculate the DTW distances between the time series.
import fastdtw
import scipy.spatial.distance as sd
def my_fastdtw(sales1, sales2):
return fastdtw.fastdtw(sales1,sales2)[0]
distance_matrix = sd.pdist(sales, my_fastdtw)
---尝试不使用 pdist()
-----
--- tried doing it without pdist()
-----
distance_matrix = []
m = len(sales)
for i in range(0, m - 1):
for j in range(i + 1, m):
distance_matrix.append(fastdtw.fastdtw(sales[i], sales[j]))
---并行化内部for循环-----
from joblib import Parallel, delayed
import multiprocessing
import fastdtw
num_cores = multiprocessing.cpu_count() - 1
N = 687
def my_fastdtw(sales1, sales2):
return fastdtw.fastdtw(sales1,sales2)[0]
results = [[] for i in range(N)]
for i in range(0, N- 1):
results[i] = Parallel(n_jobs=num_cores)(delayed(my_fastdtw) (sales[i],sales[j]) for j in range(i + 1, N) )
所有方法都非常慢.并行方法大约需要12分钟.有人可以建议一种有效的方法吗?
All the methods are very slow. The parallel method takes around 12 minutes. Can someone please suggest an efficient way?
---按照下面的答案中提到的步骤---
--- Following the steps mentioned in the answer below---
以下是lib文件夹的外观:
Here is how the lib folder looks like:
VirtualBox:~/anaconda3/lib/python3.6/site-packages/fastdtw-0.3.2-py3.6- linux-x86_64.egg/fastdtw$ ls
_fastdtw.cpython-36m-x86_64-linux-gnu.so fastdtw.py __pycache__
_fastdtw.py __init__.py
因此,其中有一个cydon版本的fastdtw.在安装时,我没有收到任何错误.即使是现在,当我在程序执行过程中按 CTRL-C
时,也可以看到正在使用纯python版本( fastdtw.py
):
So, there is a cython version of fastdtw in there. While installation, I did not receive any errors. Even now, when I pressed CTRL-C
during my program execution, I can see that the pure python version is being used (fastdtw.py
):
/home/vishal/anaconda3/lib/python3.6/site-packages/fastdtw/fastdtw.py in fastdtw(x, y, radius, dist)
/home/vishal/anaconda3/lib/python3.6/site-packages/fastdtw/fastdtw.py in __fastdtw(x, y, radius, dist)
代码仍然像以前一样慢.
The code remains slow like before.
推荐答案
TL; DR
您的 fastdtw
陷入了安装快速cpp-version的困境,并悄悄地退回到了纯Python版本,这很慢.
Your fastdtw
falled to install the fast cpp-version and falls back silently to a pure-python version, which is slow.
您需要修复 fastdtw
-package的安装.
You need to fix the installation of the fastdtw
-package.
整个计算是在 fastdtw
中完成的,因此您无法真正从外部加速计算.而且,并行化和python并不是一件容易的事(还好吗?).
The whole calculation is done in fastdtw
, so you cannot really speed it up from the outside. And parallelization and python is not such an easy thing (yet?).
fastdtw 文档说它需要大约 O(n)
个操作进行比较,因此对于您的整个测试集,大约需要个数量级> 10 ^ 9
操作,如果使用C语言进行编程,则应在几秒钟内完成.您所看到的性能远不及它.
The fastdtw
documentation says it needs about O(n)
operations for a comparison, so for your whole test-set it will need about order of magnitude of 10^9
operations, which should be finished in about some seconds, if programmed in, for example, C. The performance you see is nowhere near it.
如果我们看看 fastdtw
的代码,就会发现有两个版本:cython/cpp-version,它是快速的并通过cython导入,而慢速回退是纯python-version.如果未预先设置快速版本,则将使用慢速python版本.
If we look at the code of fastdtw
we see, that there are two versions: the cython/cpp-version which is fast and imported via cython and a slow fall back pure-python-version. If the fast version isn't preset, the slow python version is silently used.
因此运行您的计算,并用 Ctr + C
中断它,您将看到自己在python代码中.您还可以转到您的lib文件夹,看看里面只有纯python版本.
So run your calculation, interrupt it with Ctr+C
and you will see, that you are somewhere in python-code. You can also go to your lib-folder and see, that there is only the pure-python version inside.
因此,您安装快速 fastdtw
版本失败.实际上,我认为wheel-package已被破坏,至少在我的版本中,仅存在纯python代码.
So your installation of the fast fastdtw
version failed. Actually, I think the wheel-package is botched, at least for my version there is only the pure python code present.
该怎么办?
- 获取源代码,例如通过
git clone https://github.com/slaypni/fastdtw
- 进入
fstdtw
文件夹并运行python setup.py build
- 提防错误.我的是
严重错误:numpy/npy_math.h:没有这样的文件或目录
fatal error: numpy/npy_math.h: No such file or directory
- 修复它.
对我来说,解决方法是更改 setup.py
中的以下行:
For me, the fix was to change the following lines in setup.py
:
import numpy # THIS ADDED
extensions = [Extension(
'fastdtw._fastdtw',
[os.path.join('fastdtw', '_fastdtw' + ext)],
language="c++",
include_dirs=[numpy.get_include()], # AND ADDED numpy.get_include()
libraries=["stdc++"]
)]
- 重复3. + 4.直到成功
- 运行
python setup.py install
现在,您的程序应该快100倍左右.
Now your program should be about 100 times faster. `
这篇关于使用numpy或cython进行高效的成对DTW计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!