在Python中从文本创建序列向量 [英] Creating sequence vector from text in Python

查看:59
本文介绍了在Python中从文本创建序列向量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在尝试为基于LSTM的NN准备输入数据.我有大量的文本文档,我想要为每个文档制作序列向量,以便能够将它们作为训练数据提供给LSTM RNN.

I am now trying to prepare the input data for LSTM-based NN. I have some big number of text documents and what i want is to make sequence vectors for each document so i am able to feed them as train data to LSTM RNN.

我可怜的方法:

import re
import numpy as np
#raw data
train_docs = ['this is text number one', 'another text that i have']

#put all docs together
train_data = ''
for val in train_docs:
    train_data += ' ' + val

tokens = np.unique(re.findall('[a-zа-я0-9]+', train_data.lower()))
voc = {v: k for k, v in dict(enumerate(tokens)).items()}

然后使用brutforce将每个文档替换为"voc"字典.

and then brutforce replace each doc with a "voc" dict.

有没有可以帮助完成此任务的库?

Is there any libs which can help with this task?

推荐答案

解决了Keras文本预处理类: http://keras.io/preprocessing/text/

Solved with Keras text preprocessing classes: http://keras.io/preprocessing/text/

这样做:

from keras.preprocessing.text import Tokenizer, text_to_word_sequence

train_docs = ['this is text number one', 'another text that i have']
tknzr = Tokenizer(lower=True, split=" ")
tknzr.fit_on_texts(train_docs)
#vocabulary:
print(tknzr.word_index)

Out[1]:
{'this': 2, 'is': 3, 'one': 4, 'another': 9, 'i': 5, 'that': 6, 'text': 1, 'number': 8, 'have': 7}

#making sequences:
X_train = tknzr.texts_to_sequences(train_docs)
print(X_train)

Out[2]:
[[2, 3, 1, 8, 4], [9, 1, 6, 5, 7]]

这篇关于在Python中从文本创建序列向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆