在Spark NaiveBayes中处理连续数据 [英] Handling continuous data in Spark NaiveBayes
问题描述
根据Spark NaiveBayes的官方文档:
As per official documentation of Spark NaiveBayes:
它支持可以有限处理的多项式NB(请参阅此处) 支持的离散数据.
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
如何在Spark NaiveBayes中处理连续数据(例如,某些文档中某些数据的百分比)?
How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?
推荐答案
当前实现只能处理二进制功能,因此,为获得良好效果,您必须离散化和编码数据.对于离散化,您可以使用 Buketizer
或 QuantileDiscretizer
.如果您想使用某些领域特定的知识,则前一种价格便宜,并且可能更适合.
The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer
or QuantileDiscretizer
. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.
对于编码,您可以使用 OneHotEncoder
一个>.调整后的dropLast
Param
.
For encoding you can use dummy encoding using OneHotEncoder
. with adjusted dropLast
Param
.
因此,总体而言,您需要:
So overall you'll need:
-
每个连续要素的
-
QuantileDiscretizer
或Bucketizer
->OneHotEncoder
.
每个离散功能的 -
StringIndexer
*->OneHotEncoder
. -
VectorAssembler
结合以上所有内容.
QuantileDiscretizer
orBucketizer
->OneHotEncoder
for each continuous feature.StringIndexer
* ->OneHotEncoder
for each discrete feature.VectorAssembler
to combine all of the above.
*或预定义的列元数据.
* Or predefined column metadata.
这篇关于在Spark NaiveBayes中处理连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!