如何使用 Python 使用最近邻算法对数据进行分类? [英] How can I classify data with the nearest-neighbor algorithm using Python?

查看:27
本文介绍了如何使用 Python 使用最近邻算法对数据进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用(我希望)最近邻算法对一些数据进行分类.我在谷歌上搜索了这个问题并找到了很多库(包括 PyML、mlPy 和 Orange),但我不确定从哪里开始.

我应该如何使用 Python 实现 k-NN?

解决方案

特别是考虑到您在 Q 中提到的技术(k-最近邻),我强烈推荐scikits.learn.[注意:发布此答案后,该项目的首席开发人员通知我一个新主页项目.]

我认为该库有别于其他库的一些功能(至少是我使用过的其他 Python ML 库,其中大部分是):

  • 广泛的诊断和测试库(包括绘图模块,通过 Matplotlib)——包括特征选择算法,混淆矩阵、ROC、precision-recall等;

  • 精选的含电池"数据集(包括手写数字、面部图像等)特别适合 ML 技术;

  • 广泛的文档(鉴于这个项目是一个不错的惊喜只有大约两年)包括教程和分步示例代码(使用提供的数据集);

无一例外(至少我现在能想到的)python ML 库非常棒.(请参阅 PyMVPA 主页e,了解十几个最流行的 Python ML 库的列表.)

例如,在过去的 12 个月中,我使用了 ffnet(用于 MLP)、neurolab(也用于 MLP)、PyBrain(Q-Learning)、neurolab (MLP) 和 PyMVPA (SVM)(均可从 Python 包索引)--这些在成熟度、范围和提供的基础设施方面彼此差异很大,但我发现它们都具有非常高的质量.

不过,其中最好的可能是scikits.learn;例如,我不知道任何 python ML 库——除了 scikits.learn——包括我上面提到的三个特性中的任何一个(尽管一些有可靠的示例代码和/或教程,但我知道没有一个集成这些带有研究级数据集和诊断算法库).

其次,考虑到您打算使用的技术(k-最近邻),scikits.learn 是一个特别好的选择.Scikits.learn 包括用于 回归(返回分数)和 分类(返回类标签),以及每个的详细示例代码.

使用 scikits.learn k-最近邻模块(字面意思)再简单不过了:

<预><代码>>>># 导入 NumPy 和相关的 scikits.learn 模块>>>将 numpy 导入为 NP>>>从 sklearn 导入邻居作为 kNN>>># 加载 sklearn 提供的数据集之一>>>从 sklearn 导入数据集>>>虹膜 = datasets.load_iris()>>># 调用 load_iris() 加载了数据和类标签,所以>>># 将每个绑定到自己的变量>>>数据 = iris.data>>>class_labels = iris.target>>># 通过实例化 kNN 模块的主类来构建分类器构建器>>>kNN1 = kNN.NeighborsClassifier()>>># 现在通过传递数据和类标签构造('训练')分类器>>># 到分类器构建器>>>kNN1.fit(数据,class_labels)邻居分类器(n_neighbors=5,leaf_size=20,算法=自动")

此外,与几乎所有其他 ML 技术不同,k-近邻的关键不是编写一个工作分类器构建器,而是构建生产级 k-近邻分类器/回归器的困难步骤是持久层-- 即,存储和快速检索从中选择最近邻的数据点.对于 kNN 数据存储层,scikits.learn 包含一个用于球树的算法(我几乎一无所知,除了明显优于 kd-tree(k-NN 的传统数据结构),因为它的性能在高维特征空间中不会下降.

此外,k-最近邻需要一个合适的相似度度量(欧几里德距离是通常的选择,但并不总是最好的).Scikits.learn 包含一个独立模块,其中包含各种距离度量以及用于选择适当度量的测试算法.

最后,还有一些我没有提到的库,因为它们超出了范围(PyML、贝叶斯);它们主要不是供开发人员使用的库",而是供最终用户使用的应用程序(例如,Orange),或者它们具有不寻常或难以安装的依赖项(例如,mlpy,它需要 gsl,而后者必须从源代码构建) 至少对于我的操作系统,即 Mac OS X.

(注意:我不是 scikits.learn 的开发者/提交者.)

I need to classify some data with (I hope) nearest-neighbour algorithm. I've googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I'm unsure of where to start here.

How should I go about implementing k-NN using Python?

解决方案

Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend scikits.learn. [Note: after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.]

A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them):

  • an extensive diagnostics & testing library (including plotting modules, via Matplotlib)--includes feature-selection algorithms, confusion matrix, ROC, precision-recall, etc.;

  • a nice selection of 'batteries-included' data sets (including handwriting digits, facial images, etc.) particularly suited for ML techniques;

  • extensive documentation (a nice surprise given that this Project is only about two years old) including tutorials and step-by-step example code (which use the supplied data sets);

Without exception (at least that i can think of at this moment) the python ML libraries are superb. (See the PyMVPA homepage for a list of the dozen or so most popular python ML libraries.)

In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index)--these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality.

Still, the best of these might be scikits.learn; for instance, i am not aware of any python ML library--other than scikits.learn--that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms).

Second, given you the technique you intend to use (k-nearest neighbor) scikits.learn is a particularly good choice. Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each.

Using the scikits.learn k-nearest neighbor module (literally) couldn't be any easier:

>>> # import NumPy and the relevant scikits.learn module
>>> import numpy as NP
>>> from sklearn import neighbors as kNN

>>> # load one of the sklearn-suppplied data sets
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> # the call to load_iris() loaded both the data and the class labels, so
>>> # bind each to its own variable
>>> data = iris.data
>>> class_labels = iris.target

>>> # construct a classifier-builder by instantiating the kNN module's primary class
>>> kNN1 = kNN.NeighborsClassifier()

>>> # now construct ('train') the classifier by passing the data and class labels
>>> # to the classifier-builder
>>> kNN1.fit(data, class_labels)
      NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')

What's more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer--i.e., storage and fast retrieval of the data points from which the nearest neighbors are selected. For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn't degrade in higher dimensional features space.

Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one.

Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); they are not primarily 'libraries' for developers but rather applications for end users (e.g., Orange), or they have unusual or difficult-to-install dependencies (e.g., mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X.

(Note: i am not a developer/committer for scikits.learn.)

这篇关于如何使用 Python 使用最近邻算法对数据进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆