lucene 4.0统计 [英] lucene 4.0 statistics

查看:125
本文介绍了lucene 4.0统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尽管这是我第二次发布相同的问题,第一个在这里,但没有答案,也没有部分答案.我一直在努力解决这个问题,并迷失在Lucene API中...

although this is a second time I'm posting the same question, the first one is here, but no answer, or partial answer. I've been struggling with this issue, and lost in the lucene api...

我感兴趣的是从Lucene获取文档长度.当我使用searcher.explain(使用bm25)时,我看到此功能存在,但是我只需要获取它即可.

What I'm interested is, to get the document length from the Lucene. When I use searcher.explain (using bm25), I see that this feature exists, but I only need to fetch it.

我非常感谢一个示例,因为我是Lucene的新手,仅一点点API就无济于事.

I would highly appreciate an example, as I'm new to Lucene, just a point to API won't help.

一种简单的方法是通过使用Java中的string.length()将此长度存储在一个单独的字段中,并在查询时将其检索,但是,该特性已经存在(否则bm25将不起作用)不想多余地存储东西.

One naive way to do it is to store this length in a seperate field, by using string.length() from java, and on query time retrieve it, however, this fature already exists (otherwise bm25 won't work) hence I don't want to store something redundatly.

如果您要对如何使用Lucene 4.0实现这一点进行更详细的说明,并且如果您无法提供和回答问题,请不要仅仅出于答复的目的而做出答复,我将不胜感激.则其他人不会阅读我的文章以为它已解决!我需要更多详细信息,例如如何使用此FieldInvertState或Similarity.computeNorm ???在查询时间还是索引时间???一小段代码会有所帮助,您必须考虑到我不是这里的专家,否则我不会问

I would highly appreciate it if you'd give a more detailed explanation on how to achieve this with the lucene 4.0, and if you're not able to provide and answer, please do not reply just for sake of replying (as then others are not reading my post thinking that it is solved!!!!), nor don't send me pointer to api (e.g. See Similarity.computeNorm by Robert Muir) because this won't help me. I need more details, like how to use this FieldInvertState, or Similarity.computeNorm??? On query time or index time??? small fragment of code would be helpful, you have to consider that I'm not an expert here, otherwise I wouldn't be asking

预先感谢

推荐答案

是的,您查看的Lucene版本越新,其复杂性就越艰巨.有时,阅读早期版本的文档有助于更清楚地了解基本原理.

Yes, the newer the Lucene version you look at, the more daunting its complexity. Sometimes it helps to read the docs on an earlier version to see the basic principles more clearly.

根据您的情况...相似性是您分配给整个索引过程(IndexWriterConfig.setSimilarity)的一种策略类对象.将调用其方法来计算有关每个Document及其字段的各种信息,这些信息将添加到索引中.因此,罗伯特在这里建议的是使您的相似性"成为子类(听取文档的建议,不要直接对相似性"进行子类化,而应继承现有的实现之一,例如DefaultSimilarity).重写computeNorm方法以为传入的字段生成所需的数字.默认情况下,Lucene已经计算出了该范数,因此它可以调低长字段,所以我想您的想法比您想的要具体得多.

Now to your case... Similarity is a Strategy-kind of object that you assign to the whole indexing process (IndexWriterConfig.setSimilarity). Its methods will be called to compute various pieces of information about each Document, and each of its Fields, being added to the index. So what Robert is suggesting here is to make your Similarity subclass (take the docs' advice and don't subclass Similarity directly, but rather one of the existing implementations, like DefaultSimilarity). Override the computeNorm method to produce the number that you want for the passed-in field. By default Lucene already computes that norm so that it tones down long fields, so I guess you have something more specific than that on your mind.

如果您想认真利用Lucene,我会热烈建议掌握Lucene In Action.

I would warmly suggest getting a hold of Lucene In Action if you want to get serious about leveraging Lucene.

这篇关于lucene 4.0统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆