从 Lucene 索引中获取最高频率项 [英] Get highest frequency terms from Lucene index
问题描述
我需要从几个lucene索引中提取出现频率最高的词,用于语义分析.
i need to extract terms with highest frequencies from several lucene indexes, to use them for some semantic analysis.
因此,我想获得可能出现次数最多的前 30 个术语(仍未决定阈值,我将分析结果)及其每个索引的计数.我知道我可能会因为潜在地删除重复而失去一些精度,但是现在,可以说我对此没有意见.
So, I want to get maybe top 30 most occuring terms(still did not decide on threshold, i will analyze results) and their per-index counts. I am aware that I might lose some precision because of potentionally dropped duplicates, but for now, lets say i am ok with that.
所以对于提议的解决方案,(不用说也许)速度并不重要,因为我会做静态分析,我会强调实现的简单性,因为我对 Lucene 和我无法理解它的一些概念..
So for the proposed solutions, (needless to say maybe) speed is not important, since I would do static analysis, I would put accent on simplicity of implementation because im not so skilled with Lucene and cant wrap my mind around some concepts of it..
我无法从类似的东西中找到任何代码示例,所以所有具体的建议(代码、伪代码、代码示例的链接...)感谢所有建议!
I can not find any code samples from something similar, so all concrete advices (code, pseudocode, links to code samples...) Appreciate all the advices!
谢谢!
推荐答案
看看这个:http://sujitpal.blogspot.com/2009/02/summarization-with-lucene.html
此页面中的类具有computeTopTermQuery
方法,您应该可以轻松地对其进行改造以遍历多个索引.
The class in this page hascomputeTopTermQuery
method which you should be easily able to retrofit for going over multiple indexes.
这篇关于从 Lucene 索引中获取最高频率项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!