如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表? [英] How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

查看:60
本文介绍了如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 这个问题.我首先制作如下索引.

I have got some direction from this question. I first make the index like below.

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

然后我尝试为一个字段获取索引中的所有术语,如下所示.

I then try to get all the terms in the index for one field like below.

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

尝试 1:尝试使用 这个问题 似乎是 TermsEnum 实现的一种 BytesRefIterator 方法.

Attempt 1: trying to use the next() as used in this question which seems to be a method of BytesRefIterator implemented by TermsEnum.

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

但是,我似乎无法访问该 next() 方法.

However, I can't seem to be able to access that next() method.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

尝试 2:尝试按照 这个问题.

Attempt 2: trying to change the while loop as suggested in the comments of this question.

while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

然而,TermsEnum 似乎不被 Python 理解为迭代器.

However, it seems TermsEnum is not understood to be an iterator by Python.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

我知道我的问题可以按照 这个问题.那么我想我的问题真的是,如何获得 TermsEnum 中的所有术语?

I am aware that my question can be answered as suggested in this question. Then I guess my question really is, how do I get all the terms in TermsEnum?

推荐答案

我发现以下内容来自 这里test_FieldEnumeration() 在 Pycodelucene.py 文件位于 pylucene-8.6.1/test3/ 中.

I found that the below works from here and from test_FieldEnumeration() in the test_Pylucene.py file which is in pylucene-8.6.1/test3/.

for term in BytesRefIterator.cast_(terms_enum):
    print(term.utf8ToString())

很高兴接受比这更解释的答案.

Happy to accept an answer that has more explanation than this.

这篇关于如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆