选择分类算法对名义数据和数字数据的混合进行分类? [英] Choosing classification algorithm to classify mix of nominal and numeric data?

查看:157
本文介绍了选择分类算法对名义数据和数字数据的混合进行分类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约100,000条有关客户购买模式记录的数据集.数据集包含

I have a dataset of about 100,000 records about buying pattern of customers. The data set contains

  • 年龄(从2到120的连续值),但我还计划将其分类为年龄段.
  • 性别(0或1)
  • 地址(只能是六种类型,或者我也可以使用从1到6的数字表示)
  • 偏好商店(可能只有7家商店),这是我的上课问题.

因此,我的问题是根据偏好商店的年龄,性别和位置对客户进行分类和预测.我曾尝试使用朴素树和决策树,但是它们的分类精度在下面有些低.

So my problem is to classify and predict the customers based on their Age,gender and location for Preference shop. I have tried to use naive and decision trees but their classification accuracy is little bit low below.

我也在考虑逻辑回归,但是我不确定诸如性别和地址之类的离散值.但是,我还假设SVM具有一些核心技巧,但尚未尝试过.

I am thinking also logistic regression but I am not sure about the discrete value like gender and address. But, I have also assumed SVM with some kernal tricks but not yet tried.

因此,您建议使用哪种机器学习算法来提高这些功能的准确性.

So which machine learning algorithm do you suggest for better accuracy with these features.

推荐答案

问题是您要以连续的尺度表示名义变量,这在使用机器学习方法时在类之间施加了(虚假的)序数关系.例如,如果将地址编码为六个可能整数之一,则地址1比地址3、4、5、6更靠近地址2.当您尝试学习任何东西时,这将导致问题.

The issue is that you're representing nominal variables on a continuous scale, which imposes a (spurious) ordinal relationship between classes when you use machine learning methods. For example, if you code address as one of six possible integers, then address 1 is closer to address 2 than it is to address 3,4,5,6. This is going to cause problems when you try to learn anything.

相反,将您的6值分类变量转换为六个二进制变量,每个分类值对应一个.这样,您的原始功能将产生六种功能,其中只有一种会启用.另外,将年龄保留为整数值,因为这样会使您丢失信息,因为它是分类的.

Instead, translate your 6-value categorical variable to six binary variables, one for each categorical value. Your original feature will then give rise to six features, where only one will ever be on. Also, keep the age as an integer value since you lose information by making it categorical.

对于方法,不可能有很大的改变(至少在最初是如此).选择任何一个对您来说更容易实现的东西.但是,在运行测试集之前,请确保在开发集上运行某种交叉验证参数选择,因为所有算法的参数都可能极大地影响学习准确性.

As for approaches, it's unlikely to make much of a difference (at least initially). Go with whichever is easier for you to implement. However, make sure you run some sort of cross-validation parameter selection on a dev set before running on your test set, as all algorithms have parameters than can dramatically affect learning accuracy.

这篇关于选择分类算法对名义数据和数字数据的混合进行分类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆