如何在分类中包括单词作为数字特征 [英] How to include words as numerical feature in classification

查看:93
本文介绍了如何在分类中包括单词作为数字特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在任何机器学习算法中,将单词本身用作特征的最佳方法是什么?

Whats the best method to use the words itself as the features in any machine learning algorithm ?

我必须从特定段落中提取单词相关功能的问题.我应该使用字典中的索引作为数字功能吗?如果是这样,我将如何对其进行归一化?

The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ?

通常,单词本身如何在NLP中用作功能?

In general, How are words itself used as features in NLP ?

推荐答案

有几种常规技术,可将单词映射到 features (二维数据矩阵中的列)其中的行是各个数据向量),用于输入机器学习模型.

There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:

  • 一个 Boolean 字段,用于编码给定文档中该单词的存在或不存在;

  • a Boolean field which encodes the presence or absence of that word in a given document;

a的频率直方图 预定的一组单词,通常是构成训练数据的所有文档中最常见的X个单词(有关此内容的更多信息, 此答案的最后一段);

a frequency histogram of a predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the last paragraph of this Answer);

两个或多个的并置 字词(例如,替代"和 生活方式"以连续的顺序出现 一个不相关的意思 组成词);这种并置可以在数据模型本身中捕获,例如,一个布尔特征,表示文档中彼此直接相邻的两个特定单词的存在或不存在,或者这种关系可以在ML技术中加以利用,因为它非常幼稚.贝叶斯分类器在这种情况下会做强调文本;

the juxtaposition of two or more words (e.g., 'alternative' and 'lifestyle' in consecutive order have a meaning not related either component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;

个单词作为原始数据提取潜在特征,例如 LSA 或潜在语义分析(有时也称为LSI,用于潜在语义索引). LSA是一种基于矩阵分解的技术,可以从文本本身的单词中看不到的文本中提取潜在变量.

words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.

机器学习中的常见参考数据集包含50个左右的最常见单词(也称为停用词")(例如, a an the if )莎士比亚,伦敦,奥斯丁和米尔顿.具有单个隐藏层的基本多层感知器可以以100%的精度分离此数据集. ML数据存储库和学术论文提出了分类,这些数据集及其变体可广泛获得.结果同样常见.

A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.

这篇关于如何在分类中包括单词作为数字特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆