如何对时间序列数据执行K均值聚类? [英] How can I perform K-means clustering on time series data?

查看:622
本文介绍了如何对时间序列数据执行K均值聚类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何对时间序列数据进行K均值聚类? 我了解当输入数据是一组点时这是如何工作的,但是我不知道如何用1XM对时间序列进行聚类,其中M是数据长度.特别是,我不确定如何更新时间序列数据的聚类平均值.

我有一组带标签的时间序列,我想使用K-means算法检查是否会找回类似的标签.我的X矩阵将是N X M,其中N是时间序列数,M是数据长度,如上所述.

有人知道该怎么做吗?例如,如何修改此k均值MATLAB代码,以便它可以用于时间序列数据?另外,我希望能够使用除欧几里得距离以外的其他距离指标.

为了更好地说明我的疑问,这是我为时间序列数据修改的代码:


 % Check if second input is centroids
if ~isscalar(k) 
    c=k;
    k=size(c,1);
else
    c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end

% allocating variables
g0=ones(n,1); 
gIdx=zeros(n,1);
D=zeros(n,k);

% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
%     disp(sum(g0~=gIdx))
    g0=gIdx;
    % Loop for each centroid
    for t=1:k
        %  d=zeros(n,1);
        % Loop for each dimension
        for s=1:n
            D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2)); 
        end
    end
    % Partition data to closest centroids
    [z,gIdx]=min(D,[],2);
    % Update centroids using means of partitions
    for t=1:k

        % Is this how we calculate new mean of the time series?
        c(t,:)=mean(X(gIdx==t,:));

    end
end
 

解决方案

时间序列通常是高维的.并且您需要专门的距离函数来比较它们的相似性.另外,可能会有离群值.

k-means是为具有(有意义的)欧几里德距离的低维空间设计的.对于异常值,它不是很可靠,因为它对它们施加了平方的权重.

在时间序列数据上使用k均值对我来说听起来不是一个好主意.尝试研究更现代,更强大的群集算法.许多将允许您使用任意距离函数,包括时间序列距离,例如DTW.

How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data.

I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above.

Does anyone know how to do this? For example, how could I modify this k-means MATLAB code so that it would work for time series data? Also, I would like to be able to use different distance metrics besides Euclidean distance.

To better illustrate my doubts, here is the code I modified for time series data:


% Check if second input is centroids
if ~isscalar(k) 
    c=k;
    k=size(c,1);
else
    c=X(ceil(rand(k,1)*n),:); % assign centroid randomly at start
end

% allocating variables
g0=ones(n,1); 
gIdx=zeros(n,1);
D=zeros(n,k);

% Main loop converge if previous partition is the same as current
while any(g0~=gIdx)
%     disp(sum(g0~=gIdx))
    g0=gIdx;
    % Loop for each centroid
    for t=1:k
        %  d=zeros(n,1);
        % Loop for each dimension
        for s=1:n
            D(s,t) = sqrt(sum((X(s,:)-c(t,:)).^2)); 
        end
    end
    % Partition data to closest centroids
    [z,gIdx]=min(D,[],2);
    % Update centroids using means of partitions
    for t=1:k

        % Is this how we calculate new mean of the time series?
        c(t,:)=mean(X(gIdx==t,:));

    end
end

解决方案

Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.

k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.

Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.

这篇关于如何对时间序列数据执行K均值聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆