在 Spark NaiveBayes 中处理连续数据 [英] Handling continuous data in Spark NaiveBayes
问题描述
根据 Spark NaiveBayes 的官方文档:
As per official documentation of Spark NaiveBayes:
它支持 Multinomial NB(见这里),它可以有限地处理支持离散数据.
It supports Multinomial NB (see here) which can handle finitely supported discrete data.
如何在 Spark NaiveBayes 中处理连续数据(例如:某些文档中某些数据的百分比)?
How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?
推荐答案
当前的实现只能处理二进制特征,因此为了获得好的结果,您必须对数据进行离散化和编码.对于离散化,您可以使用 Buketizer
或 QuantileDiscretizer
.前一种成本较低,当您想使用一些特定领域的知识时可能更适合.
The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer
or QuantileDiscretizer
. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.
对于编码,您可以使用 OneHotEncoder代码>
.调整后的 dropLast
Param
.
For encoding you can use dummy encoding using OneHotEncoder
. with adjusted dropLast
Param
.
所以总的来说你需要:
QuantileDiscretizer
或Bucketizer
->OneHotEncoder
用于每个连续特征.StringIndexer
* ->OneHotEncoder
用于每个离散特征.VectorAssembler
组合以上所有内容.
QuantileDiscretizer
orBucketizer
->OneHotEncoder
for each continuous feature.StringIndexer
* ->OneHotEncoder
for each discrete feature.VectorAssembler
to combine all of the above.
* 或预定义的列元数据.
* Or predefined column metadata.
这篇关于在 Spark NaiveBayes 中处理连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!