在 Spark NaiveBayes 中处理连续数据 [英] Handling continuous data in Spark NaiveBayes

查看:25
本文介绍了在 Spark NaiveBayes 中处理连续数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Spark NaiveBayes 的官方文档:

As per official documentation of Spark NaiveBayes:

它支持 Multinomial NB(见这里),它可以有限地处理支持离散数据.

It supports Multinomial NB (see here) which can handle finitely supported discrete data.

如何在 Spark NaiveBayes 中处理连续数据(例如:某些文档中某些数据的百分比)?

How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?

推荐答案

当前的实现只能处理二进制特征,因此为了获得好的结果,您必须对数据进行离散化和编码.对于离散化,您可以使用 BuketizerQuantileDiscretizer.前一种成本较低,当您想使用一些特定领域的知识时可能更适合.

The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.

对于编码,您可以使用 OneHotEncoder.调整后的 dropLast Param.

For encoding you can use dummy encoding using OneHotEncoder. with adjusted dropLast Param.

所以总的来说你需要:

  • QuantileDiscretizerBucketizer -> OneHotEncoder 用于每个连续特征.
  • StringIndexer* -> OneHotEncoder 用于每个离散特征.
  • VectorAssembler 组合以上所有内容.
  • QuantileDiscretizer or Bucketizer -> OneHotEncoder for each continuous feature.
  • StringIndexer* -> OneHotEncoder for each discrete feature.
  • VectorAssembler to combine all of the above.

* 或预定义的列元数据.

* Or predefined column metadata.

这篇关于在 Spark NaiveBayes 中处理连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆