如何使用KNN / K-means对数据帧中的时间序列进行聚类 [英] How can I use KNN /K-means to clustering time series in a dataframe

查看:481
本文介绍了如何使用KNN / K-means对数据帧中的时间序列进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设一个数据框包含1000行。每行代表一个时间序列。



然后我建立了DTW算法来计算两行之间的距离。



我不知道什么下一步要做的是完成数据框的无监督分类任务。



如何标记数据框的所有行?

解决方案

定义


KNN算法 = K-最近邻分类算法


K均值 =基于质心的聚类算法


DTW > =动态时间扭曲针对时间序列的相似性度量算法


下面我逐步展示了如何构建两个时间序列,以及如何计算动态时间规整(DTW)算法。您可以使用

 将熊猫作为pd导入
从numpy导入为np
从dtw导入
从dtw导入dtw
从matplotlib.pyplot导入绘图
从matplotlib.pyplot导入import show show
从matplotlib.pyplot进口厘米
sklearn.cluster中的
导入KMeans sklearn.preprocessing中的
导入MultiLabelBinarizer
#关于分类,请阅读教程
#http://scikit-learn.org/stable/tutorial /basic/tutorial.html


def createTs(myStart,myLength):
index = pd.date_range(myStart,Period = myLength,freq ='H');
values = [_在range(myLength)中的random.random()];
series = pd.Series(values,index = index);
return(series)


#长度为30的时间序列,从2000年1月1日开始& 1/2/2000因此重叠
myStart ='1/1/2000'
myLength = 30
timeS1 = createTs(myStart,myLength)
myStart ='1/2 / 2000'
timeS2 = createTs(myStart,myLength)

#这可能是您的数据帧,但在这里不必要
#myDF = pd.DataFrame([x in timeS1.data中的x ],[x for timeS2.data中的x])#,columns = ['data1','data2'])

x = [xxx * 100 for xxx in sorted(timeS1.data)]
y = [时间S2.data中xx为xx的xx]

choice = dtw。

if(选择= timeseries):
print(timeS1)
print(timeS2)
if(choice == drawingPlots):
图(x)
图(y)
如果(选择== dtw):
#DTW具有一阶范数
myDiff = [xx- yy,代表xx,yy,zip(x,y)]
dist,cost,acc,path = dtw(x,y,dist = lambda x,y:np.linalg.norm(myDiff,ord = 1) )
imshow(acc.T,origin ='lower',cmap = cm.gray,插值='nearest')
plot(path [0],path [1],'w')


使用KNN对时间序列进行分类


关于应该标记什么以及使用哪些标记的问题尚不明确。因此,请提供以下详细信息



  • 我们应该在数据框中标记什么?通过DTW算法计算出的路径?

  • 哪种标签类型?二进制?多类?


之后,我们可以决定我们的分类算法,该算法可能就是所谓的KNN算法。这样可以使您拥有两个单独的数据集:训练集和测试集。通过训练集,您可以教算法标记时间序列,而测试集是一种工具,通过该工具,我们可以测量模型与模型选择工具(如AUC)的配合情况。


打开小难题,直到提供有关问题的详细信息

  #PUZZLE 
#from教程(#http: //scikit-learn.org/stable/tutorial/basic/tutorial.html)
newX = [[1,2],[2,4],[4,5],[3,2],[ 3,1]]
newY = [[0,1],[0,2],[1,3],[0,2,3],[2,4]]
newY = MultiLabelBinarizer()。fit_transform(newY)
#继续本文。

下面的第二个枚举项提供了有关分类器的Scikit-learn比较文章。


< h2> 使用K-means进行聚类(与KNN不同)

K-means是聚类算法,您可以使用其无监督版本 p>

 #无监督版本自动;的KMeans的n_clusters 
myClusters = KMeans(path)
#myClusters.fit(YourDataHere)

与KNN算法有很大不同:这里我们不需要任何标签。我会在第一个枚举项目中为您提供以下主题的进一步材料。


进一步阅读



  1. K-means是否包含K近邻算法?



  2. 关于scikit中分类器的比较学习此处




Suppose a dataframe which contains 1000 rows. Each row represents a time series.

Then I built a DTW algorithm to calculate the distance between 2 rows.

I don't know what to do next to complish an unsupervised classification task for the dataframe.

How to label all rows of the dataframe?

解决方案

Definitions

KNN algorithm = K-nearest-neighbour classification algorithm

K-means = centroid-based clustering algorithm

DTW = Dynamic Time Warping a similarity-measurement algorithm for time-series

I show below step by step about how the two time-series can be built and how the Dynamic Time Warping (DTW) algorithm can be computed. You can build a unsupervised k-means clustering with scikit-learn without specifying the number of centroids, then the scikit-learn knows to use the algorithm called auto.

Building the time-series and computing the DTW

You have have two time-series and you compute the DTW such that

import pandas as pd
import numpy as np
import random
from dtw import dtw
from matplotlib.pyplot import plot
from matplotlib.pyplot import imshow
from matplotlib.pyplot import cm

from sklearn.cluster import KMeans
from sklearn.preprocessing import MultiLabelBinarizer 
#About classification, read the tutorial
#http://scikit-learn.org/stable/tutorial/basic/tutorial.html


def createTs(myStart, myLength):
    index = pd.date_range(myStart, periods=myLength, freq='H'); 
    values= [random.random() for _ in range(myLength)];
    series = pd.Series(values, index=index);  
    return(series)


#Time series of length 30, start from 1/1/2000 & 1/2/2000 so overlap
myStart='1/1/2000'
myLength=30
timeS1=createTs(myStart, myLength)
myStart='1/2/2000'
timeS2=createTs(myStart, myLength) 

#This could be your dataframe but unnecessary here
#myDF = pd.DataFrame([x for x in timeS1.data], [x for x in timeS2.data])#, columns=['data1', 'data2'])

x=[xxx*100 for xxx in sorted(timeS1.data)]
y=[xx for xx in timeS2.data]

choice="dtw"

if (choice="timeseries"):
    print(timeS1)
    print(timeS2)
if (choice=="drawingPlots"):
    plot(x)
    plot(y)
if (choice=="dtw"):
    #DTW with the 1st order norm
    myDiff=[xx-yy for xx,yy in zip(x,y)]
    dist, cost, acc, path = dtw(x, y, dist=lambda x, y: np.linalg.norm(myDiff, ord=1))
    imshow(acc.T, origin='lower', cmap=cm.gray, interpolation='nearest')
    plot(path[0], path[1], 'w')

Classification of the time-series with KNN

It is not evident in the question about what should be labelled and with which labels? So please provide the following details

  • What should we label in the data-frame? The path computed by DTW algorithm?
  • Which type of labeling? Binary? Multiclass?

after which we can decide our classification algorithm that may be the so-called KNN algorithm. It works such that you have two separate data sets: training set and test set. By training set, you teach the algorithm to label the time series while the test set is a tool by which we can measure about how well the model works with model selection tools such as AUC.

Small puzzle left open until details provided about the questions

#PUZZLE
#from tutorial (#http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
newX = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]]
newY = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]]
newY = MultiLabelBinarizer().fit_transform(newY)
#Continue to the article.

Scikit-learn comparison article about classifiers is provided in the second enumerate item below.

Clustering with K-means (not the same as KNN)

K-means is the clustering algorithm and its unsupervised version you can use such that

#Unsupervised version "auto" of the KMeans as no assignment for the n_clusters
myClusters=KMeans(path)
#myClusters.fit(YourDataHere)

which is very different algorithm than the KNN algorithm: here we do not need any labels. I provide you further material on the topic below in the first enumerate item.

Further reading

  1. Does K-means incorporate the K-nearest-neighbour algorithm?

  2. Comparison about classifiers in scikit learn here

这篇关于如何使用KNN / K-means对数据帧中的时间序列进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆