从 Lucene 索引中获取最高频率项 [英] Get highest frequency terms from Lucene index

查看:19
本文介绍了从 Lucene 索引中获取最高频率项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从几个lucene索引中提取出现频率最高的词,用于语义分析.

i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.

因此,我想获得可能出现次数最多的前 30 个术语(仍未决定阈值,我将分析结果)及其每个索引的计数.我知道我可能会因为潜在地删除重复而失去一些精度,但是现在,可以说我对此没有意见.

So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.

所以对于提议的解决方案,(不用说也许)速度并不重要,因为我会做静态分析,我会强调实现的简单性,因为我对 Lucene 和我无法理解它的一些概念..

So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..

我无法从类似的东西中找到任何代码示例,所以所有具体的建议(代码、伪代码、代码示例的链接...)感谢所有建议!

I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!

谢谢!

推荐答案

看看这个:http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html

此页面中的类具有computeTopTermQuery 方法,您应该可以轻松地对其进行改造以遍历多个索引.

The class in this page hascomputeTopTermQuery method which you should be easily able to retrofit for going over multiple indexes.

这篇关于从 Lucene 索引中获取最高频率项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆