用字符串数组作为输入向量进行分类 [英] Classification with array of strings as input vector

查看:192
本文介绍了用字符串数组作为输入向量进行分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个与机器学习任务有关的问题。问题是基于字符串的矢量预测值。想到的最直接的想法是使用线性回归。然而,由于我的输入是非数字的,我认为我会使用我的字符串的哈希码,但是我在这里读过某处,结果将是没有意义的。另一个想法是使用字母表中的字母位置对我的字符串进行编码,但是我还没有对它进行测试,因此需要咨询。

有人可以推荐一种好的(有意义的)编码字符串的方式,以便它们可以用于线性回归算法?或者建议另一种适用于该任务的机器学习算法。总结:分类器的输入将由一个固定大小的字符串数组组成(数组是固定长度的,不是字符串),并且输出应该是0-100范围内的整数。训练数据将由这样的输入数组(x值)和相应的数字(y值)组成。 >使用向量空间模型(如)将每个 M 字符串转换为 N https://code.google.com/p/word2vec/ =nofollow> word2vec 手套。然后将这些向量连接到一个具有 M * N 组件的向量。任选地将每个组分标准化为例如0-1。您应该能够对结果运行任何回归(或分类)算法,例如,逻辑回归。

您也可以尝试使用聚类方法,将词汇表中的所有单词聚类为 N 集群,例如在单词向量上使用k-means或使用棕色聚类。然后,您可以使用一个热矢量(即 N-1 零)并且在该单词集群的索引处使用一个单词来表示输入数组中的每个单词。然后再次连接它们并对结果运行回归。


I have a question related to the machine learning task. The problem is to predict a value based on the vector of strings. The most straightforward idea that came to mind was to use linear regression. However, since my input is non-numeric, I thought I'd use hashcode of my strings, but I've read somewhere here that the results will be meaningless. Another idea was to encode my strings in base 26 using the letter positions in the alphabet, but I haven't tested it yet, thus asking for advice.

Could someone recommend a good (meaningful) way of encoding strings so that they can be used in linear regression algorithm? Or suggest another machine learning algorithm suitable for the task.

To summarise: the input to the classifier will consist of a fixed size array of strings (arrays are fixed length, not strings), and the output should be an integer in range 0-100. The training data will consist of a collection of such input arrays (x-values) with corresponding numbers (y-values).

解决方案

Transform each one of your M strings into an N-dimensional vector using a vector space model like word2vec or GloVe. Then concatenate these vectors to one vector with M*N components. Optionally normalize each component to e.g. 0-1. You should then be able to run any regression (or classification) algorithm on the result, e.g. logistic regression.

You might also try a clustering approach, where you cluster all the words in your vocabulary into N clusters, e.g. with k-means on the word vectors or using brown clustering. You could then represent each word in your input array with a one hot vector (i.e. N-1 zeros and a single one at the index of the cluster of that word). Then concatenate them again and run regression on the result.

这篇关于用字符串数组作为输入向量进行分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆