使用libsvm功能示例在Python中支持矢量机 [英] Support vector machine in Python using libsvm example of features

查看:65
本文介绍了使用libsvm功能示例在Python中支持矢量机的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刮了很多像这样的ebay标题:

I have scraped a lot of ebay titles like this one:

Apple iPhone 5 White 16GB Dual-Core

并且我已经以这种方式手动标记了所有

and I have manually tagged all of them in this way

B M C S NA

其中B =品牌(苹果)M =型号(iPhone 5)C =颜色(白色)S =尺寸(尺寸)NA =未分配(双核)

where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)

现在,我需要使用python中的libsvm库训练SVM分类器,以了解eBay标题中出现的序列模式.

Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.

我需要通过将问题视为一种分类来提取该属性(品牌,型号,颜色,尺寸)的新值.这样,我可以预测新模型.

I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.

我要考虑此功能:

* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized 
....

我不明白如何将所有这些信息提供给图书馆.官方文档缺少很多信息

I can't understand how can I give all this info to the library. The official doc lacks a lot of information

我的班级是品牌,型号,尺寸,颜色,不适用

My class are Brand, Model, Size, Color, NA

SVM算法的输入文件必须包含什么?

what does the input file of the SVM algo must contain?

如何创建它?考虑到我在问题中作为示例使用的4个功能,我是否可以提供该文件的示例?我还可以举例说明输入文件时必须使用的代码吗?

how can I create it? could I have an example of that file considering the 4 features that I put as example in my question? Can I also have an example of the code that I must use to elaborate the input file ?

*更新* 我要代表这些功能...该怎么办?

* UPDATE * I want to represent these features... How can I must do?

  1. 当前单词的身份

我认为我可以这样解释

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

如果我知道单词是Brand,则将其设置为1(真). 在训练测试中可以这样做(因为我已经标记了所有单词),但是我该如何在测试集中进行呢?我不知道单词的类别是什么(这就是为什么我要学习它的原因:D).

If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).

    当前单词的
  1. N个语法子字符串特征(N = 4,5,6) 不知道,这是什么意思?

  1. N-gram substring features of current word (N=4,5,6) No Idea, what does it means?

当前单词之前2个单词的身份. 如何为该功能建模?

Identity of 2 words before the current word. How can I model this feature?

考虑到我为第一个功能创建的图例,我有5 ^(5)个组合)

Considering the legend that I create for the 1st feature I have 5^(5) combination)

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

如何将其转换为libsvm(或scikit-learn)可以理解的格式?

How can I convert it to a format that the libsvm (or scikit-learn) can understand?

  1. 成为4个属性字典的成员

我该怎么办? 拥有4个词典(关于颜色,大小,型号和品牌),我必须创建一个bool变量,当且仅当在四个词典之一中当前单词与之匹配时,我才将其设置为true.

Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.

  1. 商标词典的专有成员

我认为像4.功能一样,我必须使用bool变量.你同意吗?

I think that like in the 4. feature I must use a bool variable. Do you agree?

推荐答案

这是有关如何使用数据训练SVM,然后使用同一数据集进行评估的分步指南.也可以在 http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f.在url上,您还可以看到中间数据的输出以及由此产生的准确性(这是 iPython笔记本 )

Here's a step-by-step guide for how to train an SVM using your data and then evaluate using the same dataset. It's also available at http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f. At the url you can also see the output of the intermediate data and the resulting accuracy (it's an iPython notebook)

您需要安装以下库:

  • 熊猫
  • scikit学习

从命令行:

pip install pandas
pip install scikit-learn

第1步:加载数据

我们将使用熊猫加载数据. pandas是一个用于轻松加载数据的库.为了说明,我们首先保存 将数据采样到csv,然后将其加载.

Step 1: Load the data

We will use pandas to load our data. pandas is a library for easily loading data. For illustration, we first save sample data to a csv and then load it.

我们将使用train.csv训练SVM,并使用test.csv

We will train the SVM with train.csv and get test labels with test.csv

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

第2步:处理数据

我们将数据帧转换为numpy数组,该格式为scikit- 学习了解.

Step 2: Process the data

We will convert our dataframe into numpy arrays which is a format that scikit- learn understands.

我们还需要将标签"B","M","C",...转换为数字,因为svm确实 不懂字符串.

We need to convert the labels "B", "M", "C",... to numbers also because svm does not understand strings.

然后我们将使用数据训练线性svm

Then we will train a linear svm with the data

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

我们在这里看到train_labels(5)的长度与多少行完全匹配 我们在trainfeatures中. train_labels中的每个项目对应于一行.

We see here that the length of train_labels (5) exactly matches how many rows we have in trainfeatures. Each item in train_labels corresponds to a row.

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

步骤4:在一些测试数据上评估SVM

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

链接和提示

  • 有关如何加载LinearSVC的示例代码: http://scikitlearn.org/stable /modules/svm.html#svm
  • scikit学习示例的长列表: http://scikitlearn.org/stable/auto_examples /index.html .我发现这些对您有帮助,但 经常使自己感到困惑.
  • 如果您发现SVM需要很长时间进行训练,请尝试使用LinearSVC 而是: http://scikitlearn.org/stable/modules/generation/sklearn.svm.LinearSVC.html
  • 这是另一本有关熟悉机器学习模型的教程: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
  • Links & Tips

    • Example code for how to load LinearSVC: http://scikitlearn.org/stable/modules/svm.html#svm
    • Long list of scikit-learn examples: http://scikitlearn.org/stable/auto_examples/index.html. I've found these mildly helpful but often confusing myself.
    • If you find that the SVM is taking a long time to train, try LinearSVC instead: http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
    • Here's another tutorial on getting familiar with machine learning models: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
    • 您应该能够使用此代码,并将train.csv替换为训练数据,将test.csv替换为测试数据,并获得测试数据的预测以及准确性结果.

      You should be able to take this code and replace train.csv with your training data, test.csv with your testing data, and get predictions for your test data, along with accuracy results.

      请注意,由于您要使用经过训练的数据进行准确性评估,因此异常高.

      Note that since you're evaluating using the data you trained on the accuracy will be unusually high.

      这篇关于使用libsvm功能示例在Python中支持矢量机的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆