序列数据上的 LSTM,预测离散列 [英] LSTM on sequential data, predicting a discrete column

查看:57
本文介绍了序列数据上的 LSTM,预测离散列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是机器学习的新手,只是触及了它的表面,所以如果我的问题没有意义,我深表歉意.

I am new to ML and only scratching its surface so I apologize if my question makes no sense.

我对某个对象有一系列连续测量值(获取其重量、大小、温度等)和一个离散列,用于确定对象的属性(有限范围的整数,比如 0,1,2).这是我想预测的列.

I have a sequence of continuous measurements for some object (capturing its weight, size, temperature,...) and a discrete column determining the property of the object (a finite range of integers, say 0,1,2). This is the column that I would like to predict.

有问题的数据确实是一个序列,因为属性列的值可能会根据围绕它的上下文而变化,并且序列本身也可能有一些循环属性.简而言之:数据的顺序对我很重要.

The data in question is indeed a sequence since the value of the property column may vary depending on the context surrounding it and there may also be some cycical properties to the sequence itself. In short: the order of the data matters to me.

一个小例子如下表所示

请注意,有两行包含相同的数据,但在属性"字段中具有不同的值.这个想法是属性字段的值可能取决于前面的行,因此行的顺序很重要.

Note that there are two rows containing equal data yet having a different value in the Property field. The idea is that the value of the property field may depend on the previous rows and hence the order of the rows is important.

我的问题是,我应该使用什么样的方法/工具/技术来解决这个问题?

My question is, what kind of approach/tools/techniques should I use to tackle this problem?

我知道分类算法,但不知何故,我认为它们不适用于这里,因为有问题的数据是连续的,我不想忽略这个属性.

I am aware of classification algorithms but somehow I don't think they apply here given that the data in question is sequential and I wouldn't want to ignore this property.

我尝试使用 Keras LSTM 并假装 Property 列也是连续的.然而,我以这种方式获得的预测通常只是一个常数十进制值,在这种情况下没有意义.

I tried using Keras LSTM and pretend the Property column is continuous as well. However the predictions I obtain in this way are usually just a constant decimal value that makes no sense in this context.

解决此类问题的最佳方法是什么?

What would be the best way to tackle this type of problem?

推荐答案

import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({'Temperature': [183, 10.7, 24.3, 10.7],
                   'Weight': [8, 11.2, 14, 11.2],
                   'Size': [3.97, 7.88, 11, 7.88],
                   'Property': [0,1,2,0]})

# print first 5 rows
df.head()

# adjust target(t) to depend on input (t-1)
df.Property = df.Property.shift(-1)

# parameters
time_steps = 1
inputs = 3
outputs = 1

# remove nans as a result of the shifted values
df = df.iloc[:-1,:]

# convert to numoy
df = df.values

数据预处理

# center and scale
scaler = MinMaxScaler(feature_range=(0, 1))    
df = scaler.fit_transform(df)

# X_y_split
train_X = df[:, 1:]
train_y = df[:, 0]

# reshape input to 3D array
train_X = train_X[:,None,:]

# reshape output to 1D array
train_y = np.reshape(train_y, (-1,outputs))

模型参数

learning_rate = 0.001
epochs = 500
batch_size = int(train_X.shape[0]/2)
length = train_X.shape[0]
display = 100
neurons = 100

# clear graph (if any) before running
tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, time_steps, inputs])
y = tf.placeholder(tf.float32, [None, outputs])

# LSTM Cell
cell = tf.contrib.rnn.BasicLSTMCell(num_units=neurons, activation=tf.nn.relu)
cell_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

# pass into Dense layer
stacked_outputs = tf.reshape(cell_outputs, [-1, neurons])
out = tf.layers.dense(inputs=stacked_outputs, units=outputs)

# squared error loss or cost function for linear regression
loss = tf.losses.mean_squared_error(labels=y, predictions=out)
# optimizer to minimize cost
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

在会话中执行

with tf.Session() as sess:
    # initialize all variables
    tf.global_variables_initializer().run()

    # Train the model
    for steps in range(epochs):
        mini_batch = zip(range(0, length, batch_size),
                   range(batch_size, length+1, batch_size))

        # train data in mini-batches
        for (start, end) in mini_batch:
            sess.run(training_op, feed_dict = {X: train_X[start:end,:,:],
                                               y: train_y[start:end,:]})

        # print training performance 
        if (steps+1) % display == 0:
            # evaluate loss function on training set
            loss_fn = loss.eval(feed_dict = {X: train_X, y: train_y})
            print('Step: {}  	Training loss (mse): {}'.format((steps+1), loss_fn))

    # Test model
    y_pred = sess.run(out, feed_dict={X: train_X})

    plt.title("LSTM RNN Model", fontsize=12)
    plt.plot(train_y, "b--", markersize=10, label="targets")
    plt.plot(y_pred, "k--", markersize=10, label=" prediction")
    plt.legend()
    plt.xlabel("Period")

'Output':
Step: 100       Training loss (mse): 0.15871836245059967
Step: 200       Training loss (mse): 0.03062588907778263
Step: 300       Training loss (mse): 0.0003023963945452124
Step: 400       Training loss (mse): 1.7712079625198385e-07
Step: 500       Training loss (mse): 8.750407516633363e-12

假设

  • 我假设目标 Property 是 1 个时间步后输入序列的输出.
  • 如果不是这种情况,则可以轻松重构数据输入/输出的序列格式,以更正确地适应问题用例.我认为这里的总体思路是展示如何使用 tensorflow 解决多变量时间序列预测序列问题.
  • I assumed that the target Property is the output for the sequence of inputs after 1 time step.
  • If this is not the case, the sequence format of the data input/output can easily be remodeled to fit the problem use-case more correctly. I think the general idea here is to show how to address the multi-variate time-series prediction sequence problem with tensorflow.

下面的代码将用例建模为一个分类问题,其中 RNN 算法尝试预测特定输入序列的类成员.

The code below models the use-case as a classification problem where RNN algorithm attempts to predict the class membership of a particular input sequence.

再次假设目标(t)取决于输入序列t-1`.

Again, I make the assumption that the target (t), depends on the input sequencet-1`.

import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

df = pd.DataFrame({'Temperature': [183, 10.7, 24.3, 10.7],
                   'Weight': [8, 11.2, 14, 11.2],
                   'Size': [3.97, 7.88, 11, 7.88],
                   'Property': [0,1,2,0]})

# print first 5 rows
df.head()

# adjust target(t) to depend on input (t-1)
df.Property = df.Property.shift(-1)

# parameters
time_steps = 1
inputs = 3
outputs = 3

# remove nans as a result of the shifted values
df = df.iloc[:-1,:]

# convert to numpy
df = df.values

数据预处理

# X_y_split
train_X = df[:, 1:]
train_y = df[:, 0]

# center and scale
scaler = MinMaxScaler(feature_range=(0, 1))    
train_X = scaler.fit_transform(train_X)

# reshape input to 3D array
train_X = train_X[:,None,:]

# one-hot encode the outputs
onehot_encoder = OneHotEncoder()
encode_categorical = train_y.reshape(len(train_y), 1)
train_y = onehot_encoder.fit_transform(encode_categorical).toarray()

模型参数

learning_rate = 0.001
epochs = 500
batch_size = int(train_X.shape[0]/2)
length = train_X.shape[0]
display = 100
neurons = 100

# clear graph (if any) before running
tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, time_steps, inputs])
y = tf.placeholder(tf.float32, [None, outputs])

# LSTM Cell
cell = tf.contrib.rnn.BasicLSTMCell(num_units=neurons, activation=tf.nn.relu)
cell_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

# pass into Dense layer
stacked_outputs = tf.reshape(cell_outputs, [-1, neurons])
out = tf.layers.dense(inputs=stacked_outputs, units=outputs)

# squared error loss or cost function for linear regression
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
        labels=y, logits=out))

# optimizer to minimize cost
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

定义分类评估指标

accuracy = tf.metrics.accuracy(labels =  tf.argmax(y, 1),
                          predictions = tf.argmax(out, 1),
                          name = "accuracy")
precision = tf.metrics.precision(labels=tf.argmax(y, 1),
                                 predictions=tf.argmax(out, 1),
                                 name="precision")
recall = tf.metrics.recall(labels=tf.argmax(y, 1),
                           predictions=tf.argmax(out, 1),
                           name="recall")
f1 = 2 * accuracy[1] * recall[1] / ( precision[1] + recall[1] )

在会话中执行

with tf.Session() as sess:
    # initialize all variables
    tf.global_variables_initializer().run()
    tf.local_variables_initializer().run()

    # Train the model
    for steps in range(epochs):
        mini_batch = zip(range(0, length, batch_size),
                   range(batch_size, length+1, batch_size))

        # train data in mini-batches
        for (start, end) in mini_batch:
            sess.run(training_op, feed_dict = {X: train_X[start:end,:,:],
                                               y: train_y[start:end,:]})

        # print training performance 
        if (steps+1) % display == 0:
            # evaluate loss function on training set
            loss_fn = loss.eval(feed_dict = {X: train_X, y: train_y})
            print('Step: {}  	Training loss: {}'.format((steps+1), loss_fn))

    # evaluate model accuracy
    acc, prec, recall, f1 = sess.run([accuracy, precision, recall, f1],
                                     feed_dict = {X: train_X, y: train_y})

    print('
Evaluation  on training set')
    print('Accuracy:', acc[1])
    print('Precision:', prec[1])
    print('Recall:', recall[1])
    print('F1 score:', f1)

'输出':

Step: 100       Training loss: 0.5373622179031372
Step: 200       Training loss: 0.33380019664764404
Step: 300       Training loss: 0.176949605345726
Step: 400       Training loss: 0.0781424418091774
Step: 500       Training loss: 0.0373661033809185

Evaluation  on training set
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0

这篇关于序列数据上的 LSTM,预测离散列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆