如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表? [英] How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

查看：60 发布时间：2021/7/17 19:58:45 python search lucene full-text-search pylucene

本文介绍了如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从这个问题.我首先制作如下索引.

I have got some direction from this question. I first make the index like below.

import lucene
from  org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator

index_path = "./index"

lucene.initVM()

analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
    config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)

store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)

doc = Document()
doc.add(Field("docid", "1",  TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)

writer.close()
store.close()

然后我尝试为一个字段获取索引中的所有术语，如下所示.

I then try to get all the terms in the index for one field like below.

store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)

尝试 1:尝试使用这个问题似乎是 TermsEnum 实现的一种 BytesRefIterator 方法.

Attempt 1: trying to use the next() as used in this question which seems to be a method of BytesRefIterator implemented by TermsEnum.

for lrc in reader.leaves():
    terms = lrc.reader().terms('title')
    terms_enum = terms.iterator()
    while terms_enum.next():
        term = terms_enum.term()
        print(term.utf8ToString())

但是，我似乎无法访问该 next() 方法.

However, I can't seem to be able to access that next() method.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while terms_enum.next():
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

AttributeError: 'TermsEnum' object has no attribute 'next'

尝试 2:尝试按照这个问题.

Attempt 2: trying to change the while loop as suggested in the comments of this question.

while next(terms_enum):
    term = terms_enum.term()
    print(term.utf8ToString())

然而，TermsEnum 似乎不被 Python 理解为迭代器.

However, it seems TermsEnum is not understood to be an iterator by Python.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
      2     terms = lrc.reader().terms('title')
      3     terms_enum = terms.iterator()
----> 4     while next(terms_enum):
      5         term = terms_enum.term()
      6         print(term.utf8ToString())

TypeError: 'TermsEnum' object is not an iterator

我知道我的问题可以按照这个问题.那么我想我的问题真的是，如何获得 TermsEnum 中的所有术语?

I am aware that my question can be answered as suggested in this question. Then I guess my question really is, how do I get all the terms in TermsEnum?

如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表? [英] How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用 PyLucene 从 Lucene 8.6.1 索引中获取所有令牌的列表? [英] How to get a list of all tokens from Lucene 8.6.1 index using PyLucene?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭