Sagemaker LDA 主题模型 - 如何访问训练模型的参数?还有一种简单的方法来捕捉连贯性 [英] Sagemaker LDA topic model - how to access the params of the trained model? Also is there a simple way to capture coherence

查看:37
本文介绍了Sagemaker LDA 主题模型 - 如何访问训练模型的参数?还有一种简单的方法来捕捉连贯性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Sagemaker 的新手,正在运行一些测试来衡量 NTM 和 LDA 在 AWS 上的性能与 LDA mallet 和原生 Gensim LDA 模型相比.

I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.

我想在 Sagemaker 上检查经过训练的模型,看看哪些词对每个主题的贡献最高.并且还可以衡量模型的一致性.

I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence.

通过下载输出文件解压缩并解压缩以公开 3 个文件参数,symbol.json 和 meta.json,我已经能够成功地获得对 Sagemaker 上 NTM 的每个主题贡献最高的单词.

I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.

但是,当我尝试对 LDA 执行相同的过程时,解压后的输出文件无法解压.

However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.

与 NTM 相比,我可能遗漏了一些东西,或者应该为 LDA 做一些不同的事情,但我找不到任何关于此的文档.另外,有人找到了一种计算模型一致性的简单方法吗?

Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this. Also, anyone found a simple way to calculate model coherence?

任何帮助将不胜感激!

推荐答案

这个 SageMaker 笔记本深入研究了 LDA 的科学细节,还演示了如何检查模型工件.具体来说,如何获得 Dirichlet 先验 alpha 和主题词分布矩阵 beta 的估计.您可以在标题为检查训练模型" 的部分中找到说明.为方便起见,我将在此处复制相关代码:

This SageMaker notebook, which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. Specifically, how to obtain the estimates for the Dirichlet prior alpha and the topic-word distribution matrix beta. You can find the instructions in the section titled "Inspecting the Trained Model". For convenience, I will reproduce the relevant code here:

import tarfile
import mxnet as mx

# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
    tar.extractall()

# obtain the model file (should be the only file starting with "model_")
model_list = [
    fname
    for fname in os.listdir(FILENAME_PREFIX)
    if fname.startswith('model_')
]
model_fname = model_list[0]

# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)

那应该会为您提供模型数据.请注意,存储为 beta 行的主题没有按任何特定顺序显示.

That should get you the model data. Note that the topics, which are stored as rows of beta, are not presented in any particular order.

这篇关于Sagemaker LDA 主题模型 - 如何访问训练模型的参数?还有一种简单的方法来捕捉连贯性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆