使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据 [英] Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

查看:67
本文介绍了使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 中使用 scikit-learn 来开发分类算法来预测某些客户的性别.其中,我想使用朴素贝叶斯分类器,但我的问题是我混合了分类数据(例如:在线注册"、接受电子邮件通知"等)和连续数据(例如:年龄"、长度"会员资格"等).我之前没有经常使用 scikit,但我认为高斯朴素贝叶斯适用于连续数据,而伯努利朴素贝叶斯可用于分类数据.但是,由于我希望在我的模型中同时 分类和连续数据,我真的不知道如何处理这个问题.任何想法将不胜感激!

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

推荐答案

您至少有两个选择:

  • 通过计算每个连续变量的百分位数,然后使用百分位数作为分箱边界对连续变量进行分箱,将所有数据转换为分类表示.例如,对于一个人的身高,创建以下垃圾箱:非常小"、小"、常规"、大"、非常大",确保每个垃圾箱包含大约 20% 的训练集人口.我们没有任何实用程序可以在 scikit-learn 中自动执行此操作,但自己动手做应该不会太复杂.然后在数据的这些分类表示上拟合一个唯一的多项式 NB.

  • Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

在数据的连续部分独立拟合高斯NB模型,在分类部分独立拟合多项NB模型.然后通过将类分配概率(使用 predict_proba 方法)作为新特征来转换所有数据集:np.hstack((multinomial_probas, gaussian_probas)),然后重新拟合新模型(例如新的高斯 NB)关于新功能.

Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

这篇关于使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆