如何编码输入长度可变的数据? [英] How to Encode Data of Variable Input Length?

查看:54
本文介绍了如何编码输入长度可变的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我被这个问题困扰时,我正在做一些数据科学工作,我正在尝试创建一个监督任务的模型,其中输入和输出的长度都是可变的.

I was doing some data science work when I get stuck with this issue, I'm trying to create a model for a supervised task, where both Input and Output are of variable length.

以下是有关输入和输出外观的示例:

Here is an example on how the Input and Output look like:

Input[0]: [2.42, 0.43, -5.2, -54.9]
Output[0]: ['class1', 'class3', 'class12']

问题在于,在我拥有的数据集中,输入和输出的长度是可变的,因此,为了能够使用整个数据集,我需要找到一种编码此数据的方法.

The problem is that in the dataset that I have, the Inputs and Outputs are of variable length, so in order to be able to use the entire dataset, I need to find a way of encoding this data.

首先我对输出类进行编码并添加一个填充以等于数据集中的所有输出长度,(比如 length=(6,)):

First I encoded the Outputs classes and added a padding to equal all Outputs length in the dataset, (let's say of length=(6,)):

Encoded_output[0]: [1, 3, 12, 0, 0, 0]

但是我无法找到一种编码Input的方法,因为由于原始Input数据是浮点的,所以我无法创建编码并添加填充.我不知道我还有其他选择,我想听听您如何解决这个问题.

But I can't figure out a way of encoding the Input, because, as the original Input data are floats, I cannot create an encoding and add padding. I don't know what other options I have and I would like to hear how would you solve this.

推荐答案

一种解决方法是:

  1. 找出可变数据可以达到的最大长度.
  2. 找出每个训练实例的真实长度.

从这两件事中,您可以创建一个蒙版,并让您的网络为您要忽略的内容计算零梯度.

From these two things you can create a mask and have your network compute zero gradients for the stuff you want to ignore.

示例:

import tensorflow as tf

# pretend the longest data instance we will have is 5
MAX_SEQ_LEN = 5

# batch of indices indicating true length of each instance
idxs = tf.constant([3, 2, 1])

# batch of variable-length data
rt = tf.ragged.constant(
  [
    [0.234, 0.123, 0.654],
    [0.111, 0.222],
    [0.777],
  ], dtype=tf.float32)

t = rt.to_tensor(shape=[3, MAX_SEQ_LEN])
print(t)
# tf.Tensor(
# [[0.234 0.123 0.654 0.    0.   ]
#  [0.111 0.222 0.    0.    0.   ]
#  [0.777 0.    0.    0.    0.   ]], shape=(3, 5), dtype=float32)

# use indices to create a boolean mask. We can use this mask in
# layers in our network to ignore gradients
mask = tf.sequence_mask(idxs, MAX_SEQ_LEN)
print(mask)
# <tf.Tensor: shape=(3, 5), dtype=bool, numpy=
# array([[ True,  True,  True, False, False],
#        [ True,  True, False, False, False],
#        [ True, False, False, False, False]])>

此用例通常出现在 RNN 中.您可以在 call()方法中看到一个 mask 选项,您可以在其中为可变长度的时间序列数据传递二进制掩码.

This use case commonly occurs in RNNs. You can see there is a mask option in the call() method where you can pass a binary mask for variable length time-series data.

这篇关于如何编码输入长度可变的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆