复杂对象的Python决策树分类 [英] Python decision tree classification of complex objects

查看:179
本文介绍了复杂对象的Python决策树分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收集了具有各种属性的服装/配饰产品(由Python对象表示).这些产品是通过查询外部API和抓取商家网站以获得各种属性的组合而生成的.

I have a collection of clothing / accessory products (represented by a Python object) with various attributes. These products are generated by a combination of querying an external API and scraping the merchant websites to obtain various attributes.

我的目标是开发一个使用这些属性对产品进行正确分类的分类器(即,分为裤子,T恤,连衣裙等类别).

My goal is to develop a classifier that uses these attributes to correctly categorise the products (i.e. into categories such as trousers, t-shirts, dresses etc.).

我有一个训练数据集和一个测试数据集,它们是从整体上随机选择的,统一分类的整个数据集的子集.

I have both a training and a test data set which are a subset of the entire data set selected uniformly at random which have been manually categorised.

我与我的一位前大学同事交谈,他专门研究机器学习,他建议使用决策树.但是,Python中的决策树库似乎非常注重数值(而不是专注于基于文本属性对数据进行分类).

I spoke to an ex-university colleague of mine who specialises in machine learning and he suggested using a decision tree. However, the decision tree libraries in Python appear to be very numerically focused (rather than focused on classifying data based on textual attributes).

我知道类似 Scikit Learn 之类的库,但是从我的简要分析来看看来它们通常包含比我要求的规则更简单的逻辑.

I am aware of libraries like Scikit Learn but from my brief analysis it appears that they generally involve simpler logic for the rules than I require.

任何有关方法,库,代码结构等的建议将不胜感激.但是,此问题的主要重点是哪种Python机器学习库(如果有)最适合此任务.

Any suggestions on approach, library, code structure etc would be greatly appreciated. However, the main focus of this question is which Python machine learning library (if any) would be most appropriate for this task.

产品属性包括以下内容:

The product attributes include the following:

  • 名称(str)
  • 说明(str)
  • available_sizes([str, str...])
  • available_colours([str,str ...])
  • 价格(float)
  • 网址(str)
  • 类别名称(str)
  • 图片([str, str...]-网址)
  • name (str)
  • description (str)
  • available_sizes ([str, str...])
  • available_colours ([str, str...])
  • price (float)
  • url (str)
  • category_name (str)
  • images ([str, str...] - urls)

产品示例:

{   'category': u"Men's Accessories",
    'colours': [u'White'],
    'description': u'Keep your formal style looking classic with this white short sleeve Oxford shirt with roll up sleeve detailing.',
    'ean': u'',
    'gender': u'M',
    'images': [   u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_2_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_3_large.jpg',
                  u'http://media.topman.com/wcsstore/TopMan/images/catalog/83D21DWHT_4_large.jpg'],
    'last_scraped': datetime.datetime(2014, 11, 1, 7, 13, 28, 943000),
    'merchant_id': 2479L,
    'merchant_uri': u'http://www.topman.com/en/tmuk/product/white-oxford-short-sleeve-shirt-157702?geoip=noredirect',
    'name': u'White Oxford Short Sleeve Shirt',
    'price': 26.0,
    'sizes': [u'XXS', u'XS', u'S', u'M', u'L', u'XL', u'XXL']}

推荐答案

您可以使用scikit-learn,但需要预处理数据.决策树的其他实现可以直接处理分类数据,但是并不能解决您的问题.您仍然需要预处理数据.

You can use scikit-learn, but you need to preprocess your data. Other implementations of decision trees can deal with categorical data directly, that will not solve your problems however. You still need to preprocess the data.

首先,我会省略图像,因为使用它们有些复杂. 对于所有其他变量,您需要以一种适合机器学习的方式对它们进行编码.例如,取决于给定大小是否可用,可用大小可以被编码为0或1. 如果颜色来自一组固定的字符串,则可以将它们编码为分类颜色.如果这是一个自由文本字段,则使用分类可能不是很好(例如,人们可能使用的是灰色和灰色,这将是两个完全不相关的值,或者有错别字等).

First, I would leave out the images, as using them is somewhat complex. For all the other variables, you need to encode them in a way that is sensible for machine learning. For example the available sizes could be encoded as a 0 or 1 depending on whether a given size is available. The colors could be encoded as a categorical if they come from a fixed set of strings. If this is a free text field, using a categorical might not be great (for example people might be using gray and grey, which would be two completely unrelated values, or have typos, etc.)

描述和名称可能对每种产品都是唯一的,因此使用分类变量没有意义,因为每个变量只能看到一次.对于这些,最好使用一袋单词方法对它们进行编码.

The descriptions and names are probably unique to each product, so using categorical variables there doesn't make sense, as each one will only be seen once. For these it would probably be best to encode them using a bag of word approach.

您可以在scikit-learn文档的教程部分中找到有关文本分类的教程. .您可能也想看看其他教程.

You can find a tutorial on text classification in the tutorials section of the scikit-learn documentation. You might want to have a look a the other tutorials, too.

最后,我建议从线性分类器开始,例如Naive Bayes或LinearSVC.如果要提取实际规则,单棵树最有用,并且很少在文本处理afaik中使用(通常有成千上万个特征/单词,因此提取有意义的规则很困难).如果要使用基于树的方法,则使用诸如随机森林之类的集合或梯度增强很可能会产生更好的结果.

Finally, I would suggest starting with a linear classifier, like Naive Bayes or LinearSVC. Single trees are mostly useful if you want to extract the actual rules, and are rarely used in text processing afaik (there are often tens or hundreds of thousands of features / word, so extracting meaningful rules is hard). If you want to use a tree-based method, using an ensemble like a random forest or gradient boosting will most likely yield better results.

这篇关于复杂对象的Python决策树分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆