如何计算文档集的术语频率? [英] How to count term frequency for set of documents?

查看:170
本文介绍了如何计算文档集的术语频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有以下文件的Lucene-Index:

i have a Lucene-Index with following documents:

doc1 := { caldari, jita, shield, planet }
doc2 := { gallente, dodixie, armor, planet }
doc3 := { amarr, laser, armor, planet }
doc4 := { minmatar, rens, space }
doc5 := { jove, space, secret, planet }

所以这5个文件使用了14个不同的术语: / p>

so these 5 documents use 14 different terms:

[ caldari, jita, shield, planet, gallente, dodixie, armor, amarr, laser, minmatar, rens, jove, space, secret ]

每个学期的频率:

[ 1, 1, 1, 4, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1 ]

便于阅读:

[ caldari:1, jita:1, shield:1, planet:4, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:1, rens:1, jove:1, space:2, secret:1 ]

我现在想知道的是,如何获得一组
的术语频率向量例如:

What i do want to know now is, how to obtain the term frequency vector for a set of documents?

例如:

Set<Documents> docs := [ doc2, doc3 ]

termFrequencies = magicFunction(docs); 

System.out.pring( termFrequencies );

会导致输出:

[ caldari:0, jita:0, shield:0, planet:2, gallente:1, dodixie:1, 
armor:2, amarr:1, laser:1, minmatar:0, rens:0, jove:0, space:0, secret:0 ]

删除全零:

[ planet:2, gallente:1, dodixie:1, armor:2, amarr:1, laser:1 ]

注意,结果vetor仅包含
集的术语频率文档。不是整个索引的整体频率!
术语星球在整个索引中出现4次,但源文件集
的文档只包含2次。

Notice, that the result vetor contains only the term frequencies of the set of documents. NOT the overall frequencies of the whole index! The term 'planet' is present 4 times in the whole index but the source set of documents only contains it 2 times.

天真实现将只是迭代
docs 集中的所有文档,创建一个映射并计算每个术语。
但是我需要一个解决方案,它也适用于
100.000或500.000的文档集大小。

A naive implementation would be to just iterate over all documents in the docs set, create a map and count each term. But i need a solution that would also work with a document set size of 100.000 or 500.000.

Lucene中是否有可用于获取此术语向量的功能?
如果没有这样的功能,数据结构如何看起来像
有人可以在索引时创建以便轻松快速地获得这样的术语向量

Is there a feature in Lucene i can use to obtain this term vector? If there is no such feature, how would a data structure look like someone can create at index time to obtain such a term vector easily and fast?

我不是Lucene的专家,所以我很抱歉,如果解决方案明显或微不足道。

I'm not that Lucene expert so i'am sorry if the solution is obvious or trivial.

也许值得一提:解决方案应该为Web应用程序工作得足够快,应用于客户端搜索查询。

Maybe worth to mention: the solution should work fast enough for a web application, applied to client search queries.

推荐答案

转到此处: http://lucene.apache.org/java/3_0_1/api/core/index.html 并检查此方法

Go here: http://lucene.apache.org/java/3_0_1/api/core/index.html and check this method

org.apache.lucene.index.IndexReader.getTermFreqVectors(int docno);

您必须知道文档ID。这是一个内部lucene id,它通常会在每次索引更新时更改(删除:-))。

you will have to know the document id. This is an internal lucene id and it usually changes on every index update (that has deletes :-)).

我相信lucene 2.x.x有类似的方法

I believe there is a similar method for lucene 2.x.x

这篇关于如何计算文档集的术语频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆