在Lucene中访问位置匹配周围的单词 [英] Accessing words around a positional match in Lucene

查看:117
本文介绍了在Lucene中访问位置匹配周围的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于文档中的术语匹配,访问该匹配周​​围的单词的最佳方法是什么?我已经阅读了这篇文章 http: //searchhub.org//2009/05/26/access-words-around-a-positional-match-in-lucene/, 但是问题在于,自从这篇文章(2009年)以来,Lucene API完全改变了,有人可以指出我如何在更新版本的Lucene(如Lucene 4.6.1)中执行此操作吗?

Given a term match in a document, what’s the best way to access words around that match? I have read this article http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, but the problem is that the Lucene API completely changed since this post(2009), could someone point to me how to do this in newer version of Lucene, such as Lucene 4.6.1?

编辑:

我现在已经解决了这个问题(已删除了发布API(TermEnum,TermDocsEnum,TermPositionsEnum),而使用了新的灵活索引(flex)API(Fields,FieldsEnum,Terms,TermsEnum,DocsEnum,DocsAndPositionsEnum).一个很大的不同是字段和术语现在分别枚举:TermsEnum在单个字段(而不是Term)中为每个术语提供BytesRef(包装byte []),另一个是当您请求Docs/AndPositionsEnum时,您可以指定显式地跳过skipDocs(通常这将是已删除的文档,但是通常您可以提供任何位).):

I figure this out now (The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum) have been removed in favor of the new flexible indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum, DocsEnum, DocsAndPositionsEnum). One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term. Another is that when asking for a Docs/AndPositionsEnum, you now specify the skipDocs explicitly (typically this will be the deleted docs, but in general you can provide any Bits).):

public class TermVectorFun {
  public static String[] DOCS = {
    "The quick red fox jumped over the lazy brown dogs.",
    "Mary had a little lamb whose fleece was white as snow.",
    "Moby Dick is a story of a whale and a man obsessed.",
    "The robber wore a black fleece jacket and a baseball cap.",
    "The English Springer Spaniel is the best of all dogs.",
    "The fleece was green and red",
        "History looks fondly upon the story of the golden fleece, but most people don't agree"
  };

  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, config);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher

    DirectoryReader dirReader = DirectoryReader.open(ramDir);
    IndexSearcher searcher = new IndexSearcher(dirReader);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      int start = spans.start() - window;
      int end = spans.end() + window;
      Map<Integer, String> entries = new TreeMap<Integer, String>();

      System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
      Fields fields = reader.getTermVectors(spans.doc());
      Terms terms = fields.terms("content");

      TermsEnum termsEnum = terms.iterator(null);
      BytesRef text;
      while((text = termsEnum.next()) != null) {        
        //could store the BytesRef here, but String is easier for this example
        String s = new String(text.bytes, text.offset, text.length);
        DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
        if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          int i = 0;
          int position = -1;
          while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
            if (position >= start && position <= end) {
              entries.put(position, s);
            }
            i++;
          }
        }
      }
      System.out.println("Entries:" + entries);
    }
  }
}

推荐答案

使用

Use Highlighter. Highlighter.getBestFragment can be used to get a portion of the text containing the best match. Something like:

TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);

Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));

这篇关于在Lucene中访问位置匹配周围的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆