从分段时间序列创建 Scikit-learn 标记数据集 [英] Scikit-learn labeled dataset creation from segmented time series

查看:38
本文介绍了从分段时间序列创建 Scikit-learn 标记数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个代表不同用户(即 user1 和 user2)的分段时间序列的 Pandas DataFrame.我想用提到的 DataFrame 训练一个 scikit-learn 分类器,但我无法理解我必须创建的 scikit-learn 数据集的形状.由于我的系列是分段的,因此我的 DataFrame 有一个segID"列,其中包含特定段的 ID.我将跳过分段的描述,因为它是由一种算法提供的.

I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create. Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.

我们举个例子,user1user2 都有 2 个段:print df

Let's take an example where both user1 and user2 has 2 segments: print df

        username  voltage        segID  
0       user1     -0.154732      0  
1       user1     -0.063169      0  
2       user1      0.554732      1  
3       user1     -0.641311      1  
4       user1     -0.653732      1  
5       user2      0.446469      0  
6       user2     -0.655732      0  
7       user2      0.646769      0  
8       user2     -0.646369      1  
9       user2      0.257732      1  
10      user2     -0.346369      1

问题:

scikit-learn dataset API 说要创建一个字典包含 datatarget,但是由于它们是段而不只是列表,我该如何塑造我的数据?

QUESTIONS:

scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?

我无法确定我的片段是否适合 n_samples * n_features 结构.我有两个想法:

I can't figure out my segments fitting into the n_samples * n_features structure. I have two ideas:

1) 每个 data 样本都是一个代表一个段的列表,另一方面,target 对于每个 data 都是不同的 条目,因为它们已分组.target_names 呢?这行得通吗?

1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?

{
    'data': array([
        [[-0.154732, -0.063169]],
        [[ 0.554732, -0.641311, -0.653732],
        [[ 0.446469, -0.655732, 0.646769]],
        [[-0.646369, 0.257732, -0.346369]]
        ]), 
    'target': 
        array([0, 1, 2, 3]),
    'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')

}

2) data 是(简单地)由 df.values 返回的 nparray.target 包含对每个用户不同的段 ID.... 有意义吗?

2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?

{
    'data': array([
        [-0.154732],
        [-0.063169],
        [ 0.554732],
        [-0.641311],
        [-0.653732],
        [ 0.446469],
        [-0.655732],
        [ 0.646769],
        [-0.646369],
        [ 0.257732],
        [-0.346369]
        ]), 
    'target': 
        array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
    'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}

我认为主要问题是我不知道该用什么作为标签...

I think the main problem is that I can't figure out what to use as labels...

好的,很清楚...标签是我的基本事实给出的,它们只是用户的名字.elyase 的答案正是我想要的.为了更好的说明问题,我这里解释一下segID的含义.在时间序列模式识别中,为了隔离有意义的片段,分割可能很有用.在测试时,我想识别片段而不是整个系列,因为系列相当长,片段在我的上下文中应该是有意义的.

ok it's clear... labels are given by my ground truth, they are just the user's names. elyase's answer is exactly what i was looking for. In order to better state the problem, I'm going to explain here the segID meaning. In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments. At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.

看看下面的例子,来自这个实现基于一种用于分割时间序列的在线算法".我的 segID 只是表示块 id 的列.

Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series". My segID is just a column representing the id of a chunk.

推荐答案

这不是微不足道的,可能有几种方法可以制定问题以供 ML 算法使用.您应该尝试所有方法,并找出获得最佳结果的方法.

This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.

正如您已经发现的那样,您需要两个东西,一个形状为 n_samples * n_features 的矩阵 X 和一个长度为n_samples"的列向量 y.让我们从目标 y 开始.

As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.

目标:

当你想从一个离散的用户名池中预测一个用户时,你有一个分类问题,你的目标将是一个带有 np.unique(y) == ['user1', 'user2',...]

As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]

特点

您的特征是您为每个标签/用户/目标提供 ML 算法的信息.不幸的是,大多数算法都要求此信息具有固定长度,但可变长度时间序列不太适合此描述.所以如果你想坚持经典算法,你需要一些方法来将用户的时间序列信息压缩成一个固定长度的向量.一些可能性是平均值、最小值、最大值、总和、第一个值、最后一个值、直方图、光谱功率等.您需要想出对给定问题有意义的那些.

Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.

因此,如果您忽略 SegID 信息,您的 X 矩阵将如下所示:

So if you ignore the SegID information your X matrix will look like this:

y/features 
           min max ... sum 
user1      0.1 1.2 ... 1.1    # <-first time series for user 1
user1      0.0 1.3 ... 1.1    # <-second time series for user 1
user2      0.3 0.4 ... 13.0   # <-first time series for user 2

由于 SegID 本身是一个时间序列,您还需要将其编码为固定长度信息,例如所有可能值的直方图/计数、最常见值等

As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc

在这种情况下,您将拥有:

In this case you will have:

y/features 
           min max ... sum segID_most_freq segID_min
user1      0.1 1.2 ... 1.1 1               1
user1      0.3 0.4 ... 13  2               1
user2      0.3 0.4 ... 13  5               3

算法将查看这些数据并思考":因此对于 user1,最小 segID 始终为 1,因此如果我看到一个用户的预测时间,其时间序列的最小 ID 为 1,那么它应该是 user1.如果它在 3 左右,则可能是 user2,依此类推.

The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.

请记住,这只是一种可能的方法.有时问一下很有用,在预测时我将获得哪些信息,以便我找到我正在查看的用户,以及为什么这些信息会引导到给定用户?

Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?

这篇关于从分段时间序列创建 Scikit-learn 标记数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆