如何使用单词的矢量表示(从Word2Vec等获得)作为分类器的功能? [英] How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

查看:134
本文介绍了如何使用单词的矢量表示(从Word2Vec等获得)作为分类器的功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉如何使用BOW特征进行文本分类,其中我们首先找到语料库词汇量的大小,该大小即成为我们特征向量的大小.然后,对于每个句子/文档及其所有组成词,我们根据该句子/文档中该词的存在/不存在将其设置为0/1.

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put 0/1 depending on the absence/presence of that word in that sentence/document.

但是,由于我现在尝试使用每个单词的向量表示形式,因此创建全局词汇表至关重要吗?

However, now that I am trying to use vector representation of each word, is creating a global vocabulary essential?

推荐答案

假定向量的大小为N(通常在50或500之间).概括传统的BOW的简单方法是用N个零替换0位(在BOW中),并用实向量替换1位(在BOW中)(例如Word2Vec).那么特征的大小将为N * | V | (与BOW中的| V |特征向量相比,其中| V |是唱词的大小).这种简单的概括应该可以很好地用于训练实例的数量.

Suppose the size of the vectors is N (usually between 50 or 500). The naive way of generalizing the traditional of generalizing BOW is just replacing 0 bit (in BOW) with N zeros, and replacing 1 bit (in BOW) with the the real vector (say from Word2Vec). Then the size of the features would be N * |V| (Compared to |V| feature vectors in the BOW, where |V| is the size of the vocabs). This simple generalization should work fine for decent number of training instances.

为使特征向量更小,人们使用了各种技术,例如将向量与各种操作进行递归组合. (请参阅递归/递归神经网络和类似技巧,例如: http://web.engr.illinois .edu/〜khashab2/files/2013_RNN.pdf

To make the feature vectors smaller, people use various techniques like using recursive combination of vectors with various operations. (See Recursive/Recurrent Neural Network and similar tricks, for example: http://web.engr.illinois.edu/~khashab2/files/2013_RNN.pdf or http://papers.nips.cc/paper/4204-dynamic-pooling-and-unfolding-recursive-autoencoders-for-paraphrase-detection.pdf )

这篇关于如何使用单词的矢量表示(从Word2Vec等获得)作为分类器的功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆