一种热编码,用于表示python中的语料库句子 [英] One Hot Encoding for representing corpus sentences in python

查看:46
本文介绍了一种热编码,用于表示python中的语料库句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python和Scikit-learn库的初学者.我目前需要从事一个NLP项目,首先需要通过One-Hot Encoding代表一个大型语料库.我已经阅读了Scikit-learn的有关预处理的文档.但是,似乎不是对我的术语的理解.

I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.

基本上,这个想法类似于以下内容:

basically, the idea is similar as below:

  • 1000000周日;0100000星期一;0010000星期二;...0000001星期六;

如果语料库只有7个不同的单词,那么我只需要一个7位数的向量来表示每个单词.然后,可以通过所有向量的合取来表示一个完整的句子,这是一个句子矩阵.但是,我在Python中尝试过,似乎无法正常工作...

if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix. However, I tried in Python, it seems not working...

我该如何解决?我的语料库有很多不同的词.

How can I work this out? my corpus have a very large amount of different words.

顺便说一句,似乎向量如果大多数都用零来满足,我们可以使用Scipy.Sparse来减小存储空间,例如CSR.

Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.

因此,我的整个问题将是:

Hence, my entire question will be:

如何用OneHotEncoder表示语料库中的句子并将其存储在SparseMatrix中?

how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?

谢谢你们.

推荐答案

要使用OneHotEncoder,您可以将文档拆分为令牌,然后将每个令牌映射到一个id(对于相同的字符串始终相同).然后将OneHotEncoder应用于该列表.默认情况下,结果是一个稀疏矩阵.

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix.

两个简单文档的示例代码 A B B B :

Example code for two simple documents A B and B B:

from sklearn.preprocessing import OneHotEncoder
import itertools

# two example documents
docs = ["A B", "B B"]

# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]

# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}

# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_doc] for tokens_doc in tokens_docs]

# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)

print X.toarray()

打印(每个文档一个串联的热载体):

Prints (one hot vectors in concatenated form per document):

[[ 1.  0.  0.  1.]
 [ 0.  1.  0.  1.]]

这篇关于一种热编码,用于表示python中的语料库句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆