如何将文本转换为矢量 [英] how to transform a text to vector

查看:132
本文介绍了如何将文本转换为矢量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习分类.我读过有关使用向量的信息.但是我找不到一种算法,可以将带有单词的文本转换为矢量.是关于生成单词的哈希并将矢量的哈希位置加1的问题吗?

I'm learning classification. I read about using vectors. But I can't find an algorithm to translate a text with words to a vector. Is it about generating a hash of the words and adding a 1 to the hash location in the vector?

推荐答案

大多数人谈论将文本转换为 功能向量,它们的意思是记录单词(令牌)的存在.

When most people talk about turning text into a feature vector, all they mean is recording the presence of the word (token).

编码向量的两种主要方法.一个是明确的,对于每个不存在(但在您的词汇表中)的单词,您都有一个0.另一种方法是隐式的-就像稀疏矩阵(但只是单个向量)一样- -仅对频率值为>= 1的字词进行编码.

Two main ways to encode a vector. One is explicit, where you have a 0 for each word that is not present (but is in your vocabulary). The other way is implicit---like a sparse matrix (but just a single vector)---where you only encode terms with a frequency value >= 1.

最能说明这一点的主要文章很可能是词袋模型,广泛用于自然语言处理应用程序.

The main article that explains this the best is most likely the bag of words model, which is used extensively for natural language processing applications.

假设您掌握词汇表:

{brown, dog, fox, jumped, lazy, over, quick, the, zebra}

句子"the quick brown fox jumped over the lazy dog"可以编码为:

<1, 1, 1, 1, 1, 1, 1, 2, 0>

记住,位置很重要.

即使句子"the zebra jumped"的长度较短,它也会被编码为:

The sentence "the zebra jumped"---even though it is shorter in length---would then be encoded as:

<0, 0, 0, 1, 0, 0, 0, 1, 1>

显式方法的问题在于,如果您有成千上万个词汇,每个文档也将有成千上万个术语(大多数为零值).

The problem with the explicit approach is that if you have hundreds of thousands of vocabulary terms, each document will also have hundreds of thousands of terms (with mostly zero values).

在这种情况下,句子"the zebra jumped"可以编码为:

In this case, the sentence "the zebra jumped" could be encoded as:

<'jumped': 1, 'the': 1, 'zebra': 1>

顺序是任意的.

这篇关于如何将文本转换为矢量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆