LDA模型预测不一致 [英] LDA model prediction nonconsistance

查看:352
本文介绍了LDA模型预测不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我训练了一个LDA模型并将其加载到环境中以转换新数据:

I trained a LDA model and load it into the environment to transform the new data:

from pyspark.ml.clustering import LocalLDAModel

lda = LocalLDAModel.load(path)
df = lda.transform(text)

该模型将添加一个名为 topicDistribution 的新列.我认为,对于相同的输入,此分布应该相同,否则此模型不一致.但是,实际上没有.

The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.

请问为什么以及如何修复它?

May I ask the reason why and how to fix it?

推荐答案

LDA在训练时以及根据实现情况在推断新数据时会使用随机性. Spark中的实现基于EM MAP推断,因此我认为它仅在训练模型时使用随机性.这意味着每次训练和运行算法时,结果都会有所不同.

LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.

要在相同的输入和相同的参数上运行时获得相同的结果,可以在训练模型时设置随机种子.例如,将随机种子设置为1:

To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:

model = LDA.train(data, k=2, seed=1)

要在转换新数据时设置种子,请创建一个参数映射以覆盖默认值(种子为None).

To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).

lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)

有关覆盖模型参数的更多信息,请参见此处.

For more information about overwriting model parameters, see here.

这篇关于LDA模型预测不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆