Pybrain文本分类:数据和输入 [英] Pybrain Text Classification: data and input

查看:100
本文介绍了Pybrain文本分类:数据和输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3套句子(字数不同),但是我不知道如何从文本中提取特征,以使输入维保持不变.

I have 3 sets of sentences (varying in word counts), but I don't know how to extract features from the text such that the input dimension will remain the same.

例如,我尝试过单词袋,但是由于字数变化导致输入维数变化,所以我最终会出错.

For example, I've tried bag-of-words but, since the word-count variation causes input-dimension variation, I eventually get errors.

如果您能向我展示一种为神经网络准备字符串数据的方法,我将不胜感激.

I would much appreciate it if you could show me an approach to preparing the string data for the neural network.

谢谢!

(Windows 7中为Python 2.7)

(Python 2.7 in Windows 7)

推荐答案

如何格式化输入

这是 wikipedia.org

How to format the input

This is an extraction from wikipedia.org


约翰喜欢看电影.玛丽也喜欢.


约翰还喜欢看足球比赛.

John likes to watch movies. Mary likes too.


John also likes to watch football games.


{
    "John": 1,
    "likes": 2,
    "to": 3,
    "watch": 4,
    "movies": 5,
    "also": 6,
    "football": 7,
    "games": 8,
    "Mary": 9,
    "too": 10
}

,其中有10个不同的词.并使用字典的索引,每个文档由一个10项向量表示:

which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]


无论文档的长度如何,您的输入将保持不变. 希望对您有帮助.

这篇关于Pybrain文本分类:数据和输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆