使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据 [英] Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

查看：67 发布时间：2021/12/14 9:30:26 python machine-learning data-mining classification scikit-learn

本文介绍了使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Python 中使用 scikit-learn 来开发分类算法来预测某些客户的性别.其中，我想使用朴素贝叶斯分类器，但我的问题是我混合了分类数据(例如:在线注册"、接受电子邮件通知"等)和连续数据(例如:年龄"、长度"会员资格"等).我之前没有经常使用 scikit，但我认为高斯朴素贝叶斯适用于连续数据，而伯努利朴素贝叶斯可用于分类数据.但是，由于我希望在我的模型中同时分类和连续数据，我真的不知道如何处理这个问题.任何想法将不胜感激！

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

推荐答案

您至少有两个选择:

通过计算每个连续变量的百分位数，然后使用百分位数作为分箱边界对连续变量进行分箱，将所有数据转换为分类表示.例如，对于一个人的身高，创建以下垃圾箱:非常小"、小"、常规"、大"、非常大"，确保每个垃圾箱包含大约 20% 的训练集人口.我们没有任何实用程序可以在 scikit-learn 中自动执行此操作，但自己动手做应该不会太复杂.然后在数据的这些分类表示上拟合一个唯一的多项式 NB.

Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

在数据的连续部分独立拟合高斯NB模型，在分类部分独立拟合多项NB模型.然后通过将类分配概率(使用 predict_proba 方法)作为新特征来转换所有数据集:np.hstack((multinomial_probas, gaussian_probas))，然后重新拟合新模型(例如新的高斯 NB)关于新功能.

Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

这篇关于使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据 [英] Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

使用 scikit-learn 在朴素贝叶斯分类器中混合分类和连续数据 [英] Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭