scikit学习样本,尝试使用我的分类器和数据 [英] scikit learn sample try out with my classifier and data

查看:81
本文介绍了scikit学习样本,尝试使用我的分类器和数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经构建了一个小程序,该程序使用scikit-learn为给定的数据集创建分类器。现在,我想尝试此示例,以查看分类器在起作用。例如,clf必须检测猫。

I have build a small program that creates a classifier for a given dataset with scikit-learn. Now I wanted to try this example, to see the classifier at work. For example the clf has to detect "cats".

这就是我要继续的过程:

This is how I go on:

我有猫的50张图片和无猫的50张图片。

I have 50 pictures of Cats and 50 pictures of "none cats".


  1. 使用筛选功能检测器获取 data_set 的描述符

  2. 将数据分为训练集和测试集(25张猫的猫+ 25张非猫的猫= training_set,test_set相同)

  3. 从<$ c $中获取具有kmeans的聚类中心c> training_set

  4. 创建 training_set test_set 的直方图数据code>通过使用群集中心

  5. 从scikit-learn中尝试以下代码:

  1. get descriptors for data_set with sift-feature detector
  2. Split data into training set and test set (25 pictures cats + 25 pictures non cats = training_set, test_set same)
  3. get cluster centers with kmeans from the training_set
  4. create histogramm data of the training_set an test_set by using the cluster centers
  5. try this code from scikit-learn:

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                 'C': [1, 10, 100, 1000]},
                {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
  print("# Tuning hyper-parameters for %s" % score)
  print()

  clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
  clf.fit(X_train, y_train)

  print("Best parameters set found on development set:")
  print()
  print(clf.best_estimator_)
  print()
  print("Grid scores on development set:")
  print()
  for params, mean_score, scores in clf.grid_scores_:
     print("%0.3f (+/-%0.03f) for %r"
          % (mean_score, scores.std() / 2, params))
  print()
  print("Detailed classification report:")
  print()
  print("The model is trained on the full development set.")
  print("The scores are computed on the full evaluation set.")
  print()
  y_true, y_pred = y_test, clf.predict(X_test)
  print y_true
  print y_pred
  print(classification_report(y_true, y_pred))
  print()
  print clf.score(X_train, y_train)
  print "score"
  print clf.best_params_
  print "best_params"
  pred = clf.predict(X_test)
  print accuracy_score(y_test, pred)
  print "accuracy_score"


我得到的结果是:

# Tuning hyper-parameters for recall
()
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.]. 
  average=average)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.]. 
  average=average)
Best parameters set found on development set:
()
SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.]
             precision    recall  f1-score   support

        0.0       1.00      0.04      0.08        25
        1.0       0.51      1.00      0.68        25

avg / total       0.76      0.52      0.38        50

()
0.52
score
{'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
best_params
0.52
accuracy_score

似乎是clf所有人都说它是猫....但是为什么呢?

seems to be that the clf says to all thinks its a cat....but why?

data_set 是否变小以获得良好的结果?

Is the data_set to small to get a good result ?

编辑:我正在使用VLFeat检测筛选描述符

I'm using VLFeat to detecting sift descriptor

功能:

def create_descriptor_data(data, ID):
    descriptor_list = []
    datas = numpy.genfromtxt(data,dtype='str')
    for p in datas:
      locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
      if len(desc) > 500:
        desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
      descriptor_list.append(desc)
      ID += 1 # ID for filename
    return descriptor_list

# create k-mean centers from all *.txt files in directory (data)
def create_center_data(data):
    #data = numpy.vstack(data)
    n_clusters = len(numpy.unique(data))
    kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
    kmeans.fit(data)
    return kmeans, n_clusters

def create_histogram_data(kmeans, descs, n_clusters):
    histogram_list = []
    # load from each file data
    for desc in descs:
      length = len(desc)
      # create histogram from descriptors
      histogram = kmeans.predict(desc)
      histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means 
      histogram = numpy.divide(histogram, length, dtype='float')
      histogram_list.append(histogram)
    histogram = numpy.vstack(histogram_list)
    return histogram

和调用:

X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics

X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)

x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))

X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)

X_train = numpy.vstack([X_train_pos, X_train_neg])
y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])

X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)

X_test = numpy.vstack([X_test_pos, X_test_neg])
y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_estimator_)
    print()
    print("Grid scores on development set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
       print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
    print()
    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print y_true
    print y_pred
    print(classification_report(y_true, y_pred))
    print()
    print clf.score(X_train, y_train)
    print "score"
    print clf.best_params_
    print "best_params"
    pred = clf.predict(X_test)
    print accuracy_score(y_test, pred)
    print "accuracy_score"

编辑:通过更新范围并再次显示准确性来进行一些更改

Some changes by updating the range and savae again the "accuracy"

# Tuning hyper-parameters for accuracy
()
Best parameters set found on development set:
()
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
...
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.
  1.  1.  1.  0.  1.  1.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
             precision    recall  f1-score   support

        0.0       0.88      0.92      0.90        25
        1.0       0.92      0.88      0.90        25

avg / total       0.90      0.90      0.90        50

()
1.0
score
{'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}
best_params
0.9
accuracy_score

,但通过在图片上进行测试,

but by testing it on a picture with

rslt = clf.predict(test_histogram)

他还在说坐在沙发上:你是猫:D

he's still saying to a sofa: "you're a cat" :D

推荐答案

这种行为有很多可能性:

There are many possibilities of such behaviour:


  • 创建训练/测试数据时出现错误[执行错误]

  • 训练集20元素(25个带有5个交叉验证的向量,其中20个用于交叉处理)可能太小而无法很好地泛化[在拟合条件下]

  • 选中的 C gamma 参数可能太窄-此变量高度依赖数据,表示的值可能需要完全不同的 C gamma 然后是当前使用的[欠拟合/过度拟合]

  • There is an error in creation of the training/testing data [implementation error]
  • Training set of 20 element (25 vectors with 5 cross validation leaves 20 for trianing) can be too small for a good generalization [under fitting]
  • range of checked C and gamma parameters can be too narrow - this variables are highly data dependent, your representations' values can require completely different C's and gamma's then those currently used [under/over fitting]

我个人的猜测(因为没有数据很难重现问题),这是第三个选择-错误的 C gamma 参数来找到一个好的模型。

My personal guess (as without the data is hard to reproduce the issue) here is the third option - bad C and gamma parameters to find a good model.

编辑

您应该尝试很多更大的范围值,例如。

You should try much bigger ranges of values, eg.


  • C 在$code> 10之间-5 10 ^ 15

  • 伽玛 10 ^ -14 10 ^ 2

  • C between 10^-5 and 10^15
  • gamma between 10^-14 and 10^2

C=[]
gamma=[]
for i in range(21): C.append(10.0**(i-5))
for i in range(17): gamma.append(10**(i-14))


EDIT2

一旦参数范围被校正,现在您应该执行实际的案例研究。收集更多图像,分析数据表示形式(直方图是否真的足以完成此任务?),处理数据(数据是否已经归一化?也许尝试一些去相关性?),考虑使用更简单的内核-rbf可能非常具有欺骗性-一方面可以在训练中获得高分,但另一方面-在测试中完全失败。这是其过拟合功能的结果(对于任何一致的数据集,RBF-SVM均可在训练过程中获得100%的评分),因此,很难在模型的功效和泛化能力之间取得平衡。这是真正的机器学习之旅开始的时候,玩得开心!

Once parameters' ranges are corrected, now you should perform the actual "case study". Gather more images, analyze your data representation (is histogram really enough for this task?), process your data (is it already normalized? Maybe try some decorrelation?), consider using simplier kernels - rbf can be very deceptive - on one hand it can get great scores during training, but on the other - fail completely during testing. This is a result of its overfitting capabilities (as for any consistent data set RBF-SVM can achieve 100% score during training), so finding a balance between a model's power and generalization abilities is a hard problem. This is when actual "machine learning journey" begins, have fun!

这篇关于scikit学习样本,尝试使用我的分类器和数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆