使用主题建模Java工具箱 [英] Using topic modeling Java toolkit

查看:83
本文介绍了使用主题建模Java工具箱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究文本分类,我想使用主题模型(LDA). 我的语料库至少包含24,000个波斯新闻文件.语料库中的每个文档都采用从新闻中提取的(关键字,权重)对的格式.

I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.

我看到了两个Java工具包:槌和lingpipe. 我已经阅读了有关导入数据的槌槌教程,该教程以纯文本格式而不是我所拥有的格式获取数据.有什么办法可以改变吗?

I saw two Java toolkits: mallet and lingpipe. I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?

还阅读了一些有关lingpipe的内容,本教程中的示例使用整数数组.大数据方便吗?

Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?

我需要知道哪种LDA实现对我来说更好?还有其他适合我的数据的实现吗? (在Java中)

I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)

推荐答案

从关键字权重文件中,您可以创建一个人工文本,其中包含具有给定权重的随机顺序的单词.对如此生成的文本运行短槌以检索主题.

From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.

这篇关于使用主题建模Java工具箱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆