Solr自定义相似性-使用索引文档中的字段 [英] Solr Custom Similarity - Using a field from the indexed document

查看:97
本文介绍了Solr自定义相似性-使用索引文档中的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们当前使用的是Lucene V 4.X的非常旧的版本,现在正在迁移到Solr V 7.4.0云.我们有一个自定义的相似性类,可通过文档中使用的索引字段("RANK")来影响得分.

We are currently on a very old version of Lucene V 4.X and are now migrating to Solr V 7.4.0 cloud. We had a custom Similarity Class that we use to influence the score using an indexed field ("RANK") we have in the documents.

这是类的外观-

CustomSimilarity.java

public class CustomSimilarity extends Similarity {
    private final Similarity sim;
    private final double coefficiency;
    private String popularityRank;
    static InfoStream infoStream;

    public CustomSimilarity() {
        this.sim = new CustomPayloadSimilarity();
        this.coefficiency = 0.1;
        this.popularityRank = "RANK";
        infoStream = new LoggingInfoStream();
    }

    @Override
    public long computeNorm(FieldInvertState state) {
        return sim.computeNorm(state);

    }

    @Override
    public SimWeight computeWeight(float queryBoost, CollectionStatistics collectionStats, TermStatistics... termStats) {
        final Explanation idf = termStats.length == 1 ? ((PclnPayloadSimilarity) sim).idfExplain(collectionStats, termStats[0]) : ((PclnPayloadSimilarity) sim)
            .idfExplain(collectionStats, termStats);
        float[] normTable = new float[256];
        for (int i = 1; i < 256; ++i) {
            int length = SmallFloat.byte4ToInt((byte) i);
            float norm = ((PclnPayloadSimilarity) sim).lengthNorm(length);
            normTable[i] = norm;
        }
        normTable[0] = 1f / normTable[255];
        return new IDFStats(collectionStats.field(), queryBoost, idf, normTable);
    }

    public float sloppyFreq(int distance) {
        return 1.0f / (distance + 1);
    }

    public float scorePayload(int doc, int start, int end, BytesRef payload) {
        return 1;
    }

    @Override
    public SimScorer simScorer(SimWeight weight, LeafReaderContext context) throws IOException {
        final IDFStats idfstats = (IDFStats) weight;
        final NumericDocValues rank1Value = context.reader().getNumericDocValues(popularityRank);
        infoStream.message("PCLNSimilarity", "NumericDocValues-1 >> rank1Value = " + rank1Value);
        System.out.println("NumericDocValues-1 >> rank1Value = " + rank1Value);

        return new SimScorer() {

            @Override
            public Explanation explain(int doc, Explanation freq) throws IOException {
                return super.explain(doc, freq);
            }

            @Override
            public float score(int doc, float freq) throws IOException {
                // float weightValue = idfstats.queryWeight;
                // // logger.trace("weight " + weightValue + "freq " + freq);
                //
                // float score = 0.0f;
                // if (rank1Value != null) {
                // score = (float) rank1Value.longValue() + score;
                // }
                //
                // if (coefficiency > 0) {
                // score = score + (float) coefficiency * weightValue;
                // }
                // return score;
                return (float) rank1Value.longValue();
            }

            @Override
            public float computeSlopFactor(int distance) {
                return sloppyFreq(distance);
            }

            @Override
            public float computePayloadFactor(int doc, int start, int end, BytesRef payload) {
                return scorePayload(doc, start, end, payload);
            }
        };
    }

    static class IDFStats extends SimWeight {
        private final String field;
        /** The idf and its explanation */
        private final Explanation idf;
        private final float boost;
        private final float queryWeight;
        final float[] normTable;

        public IDFStats(String field, float boost, Explanation idf, float[] normTable) {
            // TODO: Validate?
            this.field = field;
            this.idf = idf;
            this.boost = boost;
            this.queryWeight = boost * idf.getValue();
            this.normTable = normTable;
        }
    }

}

CustomPayloadSimilarity.java

public class CustomPayloadSimilarity extends ClassicSimilarity {

    @Override
    public float tf(float freq) {
        return 1;
    }

    @Override
    public float scorePayload(int doc, int start, int end, BytesRef payload) {
        if (payload != null) {
            return PayloadHelper.decodeFloat(payload.bytes, payload.offset);
        } else {
            return 1.0F;
        }

    }

    @Override
    public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats) {
        final long df = termStats.docFreq();
        final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
        final float idf = idf(df, docCount);
        return Explanation.match(idf, "idf(docFreq=" + df + ", docCount=" + docCount + ")");
      }

}

您会注意到,由于我们希望保留旧的TFIDF实现与新的TFIDF实现之间的奇偶性,因此我们仍在使用旧算法,而未切换至BM25相似性.

As you can notice, since we want to retain the parity (sort of) between older and newer TFIDF implementation, we are still using older algorithm and haven't switch to BM25Similarity.

使用上面的代码,我无法从文档中检索RANK字段的值.因此,基本上,以下行返回了一些我无法登录到solr.log文件的值- final NumericDocValues rank1Value = context.reader().getNumericDocValues(popularityRank);

With the above code, I am unable to retrieve the value of RANK field from the document. So essentially, the following line is returning some value which I am unable to log to the solr.log file - final NumericDocValues rank1Value = context.reader().getNumericDocValues(popularityRank);

但是return (float) rank1Value.longValue()引发以下异常-

"java.lang.IndexOutOfBoundsException
at java.nio.Buffer.checkIndex(Buffer.java:546)
at java.nio.DirectByteBuffer.getInt(DirectByteBuffer.java:685)
at org.apache.lucene.store.ByteBufferGuard.getInt(ByteBufferGuard.java:128)
at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.readInt(ByteBufferIndexInput.java:415)
at org.apache.lucene.util.packed.DirectReader$DirectPackedReader28.get(DirectReader.java:248)
at org.apache.lucene.codecs.lucene70.Lucene70DocValuesProducer$4.longValue(Lucene70DocValuesProducer.java:490)
at com.priceline.rc.solr.similarity.CustomSimilarity$1.score(CustomSimilarity.java:117)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:65)
at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:64)
at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:263)
at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:214)
at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:662)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:463)
at org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:217)
at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1622)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1439)
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:586)
at org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:678)
at java.lang.Thread.run(Thread.java:745)\n"

有什么建议吗?

推荐答案

您正在尝试从NumericDocValues获取值,而不用advanceExact()设置当前文档.请记住,每个文档都有一个用于该帐户的NumericDocValues,在请求值之前,您仍然需要告诉它要引用的文档.在得分功能中,尝试在调用rank1Value.longValue()之前添加advanceExact(doc).

You are trying to get a value from NumericDocValues without setting the current document with advanceExact(). Remember that there's a single NumericDocValues for that accounts for every document, you still need to tell it which document you are referring to before requesting a value. In your score function try adding advanceExact(doc) before calling rank1Value.longValue().

应该是这样的:

if(advanceExact(doc))
    return (float) rank1Value.longValue();
else
    return 0; // or whatever value you want as default

这篇关于Solr自定义相似性-使用索引文档中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆