如何在两边填充零并在keras中将序列编码成一个热点? [英] How to zero pad on both sides and encode the sequence into one hot in keras?

查看:82
本文介绍了如何在两边填充零并在keras中将序列编码成一个热点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文本数据如下.

X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
       'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
       'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])

很明显,不同的序列具有不同的长度.如何将序列两侧的序列清零以达到最大长度.然后根据每个字符将每个序列转换为一种热编码?

As it is evident that different sequences have different length. How can I zero pad the sequence on both sides of the sequence to some maximum length. And then convert each sequence into one hot encoding based on each characters?

尝试:

我使用了以下keras API,但不适用于字符串序列.

I used the following keras API but it doesn't work with strings sequence.

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)

我可能需要先将序列数据转换为一个热向量,然后再对其进行零填充.为此,我尝试如下使用Tokanize.

I might need to convert my sequence data into one hot vectors first and then zero pad it. For that I tried to use Tokanizeas follows.

tk = Tokenizer(nb_words=?, split=?)

但是,由于我的序列数据没有空格,拆分值和nb_words应该是什么?如何将其用于基于角色的热门游戏?

But then, what should be the split value and nb_words as my sequence data doesn't have any space? How to use it for character based one hot?

我的总体目标是对序列进行零填充,然后将其转换为一个热门序列,然后再将其输入RNN.

MY overall goal is to zero pad my sequences and convert it to one hot before I feed it into RNN.

推荐答案

因此,我遇到了一种方法,首先使用Tokenizer然后使用pad_sequences将我的序列清零,如下所示.

So i came across a way to do by using Tokenizer first and then pad_sequences to zero pad my sequence in the start as follows.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X_train_orignal)

sequence_of_int = tokenizer.texts_to_sequences(X_train_orignal)

这给了我如下输出.

[[3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 4, 1, 7, 5, 1, 2, 1, 1, 2, 1, 6, 1, 7],
 [3,
  1,
  4,
  2,
  3,
  5,
  1,
  6,
  2,
  1,
  4,
  1,
  7,
  5,
  1,
  2,
  1,
  4,
  1,
  7,
  5,
  1,
  2,
  1,
  6,
  1,
  7],
 [3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 2, 1, 1, 4, 2, 1, 6, 1, 7, 5, 1, 7],
 [3, 1, 4, 2, 3, 5, 1, 6, 2, 1, 1, 4, 2, 1, 1, 2, 1, 6, 1, 7, 5, 1, 7],
 [3,
  1,
  6,
  2,
  1,
  4,
  1,
  2,
  1,
  4,
  1,
  2,
  1,
  6,
  5,
  8,
  10,
  11,
  9,
  4,
  8,
  3,
  12,
  9,
  5,
  2,
  3,
  5,
  8,
  10,
  11,
  9,
  4,
  8,
  3,
  12,
  9,
  5,
  2,
  3]]

现在我不明白为什么它会以列格式给出sequence_of_int[1], sequence_of_int[4]输出?

Now I do not understand why it is giving sequence_of_int[1], sequence_of_int[4] output in column format?

获得令牌后,我按如下方法应用了pad_sequences.

After getting the tokens, I applied the pad_sequences as follows.

seq=keras.preprocessing.sequence.pad_sequences(sequence_of_int, maxlen=None, dtype='int32', padding='pre', value=0.0)

它给我的输出如下.

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  3,  1,  4,  2,  3,  5,  1,  6,  2,  1,  4,  1,  7,  5,  1,
         2,  1,  1,  2,  1,  6,  1,  7],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  3,  1,  4,
         2,  3,  5,  1,  6,  2,  1,  4,  1,  7,  5,  1,  2,  1,  4,  1,
         7,  5,  1,  2,  1,  6,  1,  7],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  3,  1,  4,  2,  3,  5,  1,  6,  2,  1,  1,  2,  1,  1,  4,
         2,  1,  6,  1,  7,  5,  1,  7],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  3,  1,  4,  2,  3,  5,  1,  6,  2,  1,  1,  4,  2,  1,  1,
         2,  1,  6,  1,  7,  5,  1,  7],
       [ 3,  1,  6,  2,  1,  4,  1,  2,  1,  4,  1,  2,  1,  6,  5,  8,
        10, 11,  9,  4,  8,  3, 12,  9,  5,  2,  3,  5,  8, 10, 11,  9,
         4,  8,  3, 12,  9,  5,  2,  3]], dtype=int32)

然后,我将其转换为一个热点,如下所示.

Then after that, I converted it into one hot as follows.

one_hot=keras.utils.to_categorical(seq)

这篇关于如何在两边填充零并在keras中将序列编码成一个热点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆