如何从Lucene 8.6.1索引中获取所有令牌的列表? [英] How to get a list of all tokens from Lucene 8.6.1 index?

查看:80
本文介绍了如何从Lucene 8.6.1索引中获取所有令牌的列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看过如何可以从Solr/Lucene索引中获取所有令牌的列表?,但是Lucene 8.6.1似乎没有提供 IndexReader.terms().它是否已被移动或更换?有没有一种比更简单的方法>这个答案?

I have looked at how to get a list of all tokens from Solr/Lucene index? but Lucene 8.6.1 doesn't seem to offer IndexReader.terms(). Has it been moved or replaced? Is there an easier way than this answer?

推荐答案

某些历史记录

你问:我只是想知道 IndexReader.terms() 是否已经移动或被替代品取代.

You asked: I'm just wondering if IndexReader.terms() has moved or been replaced by an alternative.

Lucene v3方法 v4 alpha发行说明.

The Lucene v3 method IndexReader.terms() was moved to AtomicReader in Lucene v4. This was documented in the v4 alpha release notes.

(请记住,Lucene v4是在2012年发布的.)

(Bear in mind that Lucene v4 was released way back in 2012.)

v4中 AtomicReader 中的方法采用

The method in AtomicReader in v4 takes a field name.

v4发行说明指出:

一个很大的不同是字段和术语现在分别枚举:TermsEnum在单个字段(而不是Term)中为每个术语提供BytesRef(包装byte []).

其中的关键部分是单个字段中的每个术语" .因此,从那时起,不再有单个API调用即可从索引中检索所有术语.

The key part there is "per term within a single field". So from that point onward there was no longer a single API call to retrieve all terms from an index.

此方法一直延续到以后的版本-除了将 AtomicReader AtomicReaderContext 类重命名为 LeafReader LeafReaderContext 在Lucene v 5.0.0中.参见 Lucene-5569 .

This approach has carried through to later releases - except that the AtomicReader and AtomicReaderContext classes were renamed to LeafReader and LeafReaderContext in Lucene v 5.0.0. See Lucene-5569.

最新版本

这使我们能够访问术语列表-但仅基于每个字段:

That leaves us with the ability to access lists of terms - but only on a per-field basis:

以下代码基于Lucene的最新版本(8.7.0),但对于您提到的版本(8.6.1)也应适用-以使用Java的示例为例:

The following code is based on the latest release of Lucene (8.7.0), but should also hold true for the version you mention (8.6.1) - with the example using Java:

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

上面的示例假定索引如下:

The above example assumes an index as follows:

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

如果您需要枚举字段名称,请此问题中的代码可以提供一个起点.

If you need to enumerate field names, the code in this question may provide a starting point.

最终通知

我想您也可以在每个文档的基础上访问术语,而不是在注释中提到的每个字段的基础上.我还没有尝试过.

I guess you can also access terms on a per document basis, instead of a per field basis, as mentioned in the comments. I have not tried this.

这篇关于如何从Lucene 8.6.1索引中获取所有令牌的列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆