如何在scikit-learn的SVM中使用非整数字符串标签? Python [英] How do I do use non-integer string labels with SVM from scikit-learn? Python

查看:120
本文介绍了如何在scikit-learn的SVM中使用非整数字符串标签? Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scikit-learn具有相当用户友好的python模块,可用于机器学习.

Scikit-learn has fairly user-friendly python modules for machine learning.

我正在尝试为自然语言处理(NLP)训练SVM标记器,其中我的标签和输入数据是单词和注释.例如.词性标记,而不是将双精度/整数数据用作输入元组[[1,2], [2,0]],我的元组看起来像这样[['word','NOUN'], ['young', 'adjective']]

I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]

任何人都可以举例说明如何将SVM与字符串元组一起使用吗?此处给出的教程/文档适用于整数/双精度输入. http://scikit-learn.org/stable/modules/svm.html

Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html

推荐答案

大多数机器学习算法都会处理输入样本,这些样本是浮点数的向量,从而使一对样本之间的距离很小(通常是欧几里得距离) 意味着,这两个样本在某种程度上与眼前的问题有关相似.

Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.

机器学习从业者的责任是找到一组良好的浮点特征进行编码. 此编码是特定于域的,因此,没有一种通用的方法可以从可在所有应用程序域(各种NLP任务,计算机视觉,事务日志分析...)中使用的原始数据中构建表示.机器学习建模工作的这一部分称为特征提取.当涉及大量的手动工作时,通常称为功能工程.

It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.

现在针对您的特定问题,可以使用scikit-learn的DictVectorizer特征提取帮助程序类.

Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.

这篇关于如何在scikit-learn的SVM中使用非整数字符串标签? Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆