在Spark NaiveBayes中处理连续数据 [英] Handling continuous data in Spark NaiveBayes

查看:92
本文介绍了在Spark NaiveBayes中处理连续数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据Spark NaiveBayes的官方文档:

As per official documentation of Spark NaiveBayes:

它支持可以有限处理的多项式NB(请参阅此处) 支持的离散数据.

It supports Multinomial NB (see here) which can handle finitely supported discrete data.

如何在Spark NaiveBayes中处理连续数据(例如,某些文档中某些数据的百分比)?

How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?

推荐答案

当前实现只能处理二进制功能,因此,为获得良好效果,您必须离散化和编码数据.对于离散化,您可以使用 Buketizer QuantileDiscretizer .如果您想使用某些领域特定的知识,则前一种价格便宜,并且可能更适合.

The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.

对于编码,您可以使用 OneHotEncoder .调整后的dropLast Param.

For encoding you can use dummy encoding using OneHotEncoder. with adjusted dropLast Param.

因此,总体而言,您需要:

So overall you'll need:

    每个连续要素的
  • QuantileDiscretizerBucketizer-> OneHotEncoder.
  • 每个离散功能的
  • StringIndexer *-> OneHotEncoder.
  • VectorAssembler结合以上所有内容.
  • QuantileDiscretizer or Bucketizer -> OneHotEncoder for each continuous feature.
  • StringIndexer* -> OneHotEncoder for each discrete feature.
  • VectorAssembler to combine all of the above.

*或预定义的列元数据.

* Or predefined column metadata.

这篇关于在Spark NaiveBayes中处理连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆