Apache Lucene：如何在索引时使用TokenStream手动接受或拒绝令牌 [英] Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

查看：164 发布时间：2018/8/2 13:40:05 java python apache indexing lucene

本文介绍了Apache Lucene：如何在索引时使用TokenStream手动接受或拒绝令牌的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种用Apache Lucene编写自定义索引的方法（确切地说PyLucene，但Java答案很好）。

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).

我想要什么要做的是：当向索引添加文档时，Lucene会对其进行标记，删除停用词等。如果我没有弄错的话，通常使用 Analyzer 来完成。

What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.

我想要实现的内容如下：在Lucene存储给定术语之前，我想执行查找（比如在字典中）来检查是否保留这个词或丢弃它（如果这个词存在于我的词典中，我保留它，否则我将其丢弃）。

What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).

我该怎么办？

这是（在Python中）我的分析器的自定义实现：

Here is (in Python) my custom implementation of the Analyzer :

class CustomAnalyzer(PythonAnalyzer):

    def createComponents(self, fieldName, reader):

        source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
        filter = StandardFilter(Version.LUCENE_4_10_1, source)
        filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
        filter = StopFilter(Version.LUCENE_4_10_1, filter,
                            StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        ts = tokenStream.getTokenStream()
        token = ts.addAttribute(CharTermAttribute.class_)
        offset = ts.addAttribute(OffsetAttribute.class_)

        ts.reset()

         while ts.incrementToken():
           startOffset = offset.startOffset()
           endOffset = offset.endOffset()
           term = token.toString()
           # accept or reject term 

         ts.end()
         ts.close()

           # How to store the terms in the index now ?

         return ????

提前感谢您的指导！

编辑1 ：在深入研究Lucene的文档后，我认为它与 TokenStreamComponents 有关。它返回一个TokenStream，您可以使用它来遍历您正在编制索引的字段的Token列表。

EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.

现在与属性有关我不明白。或者更确切地说，我可以读取令牌，但不知道我应该如何继续。

Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.

编辑2 ：我发现这是发布他们提到使用 CharTermAttribute 。但是（在Python中）我无法访问或获取 CharTermAttribute 。有什么想法？

EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?

EDIT3 ：我现在可以访问每个字词，请参阅更新代码段。现在剩下要做的事实上是存储所需的术语 ...

EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

推荐答案

这种方式我试图解决问题是错误的。这个发布和 femtoRgon 的答案是解决方案。

The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.

通过定义扩展 PythonFilteringTokenFilter 的过滤器，我可以使用函数 accept（）（例如，在 StopFilter 中使用的那个）。

By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).

以下是相应的代码片段：

Here is the corresponding code snippet :

class MyFilter(PythonFilteringTokenFilter):

  def __init__(self, version, tokenStream):
    super(MyFilter, self).__init__(version, tokenStream)
    self.termAtt = self.addAttribute(CharTermAttribute.class_)


  def accept(self):
    term = self.termAtt.toString()
    accepted = False
    # Do whatever is needed with the term
    # accepted = ... (True/False)
    return accepted

然后只需追加过滤到其他过滤器（如问题的代码片段）：

Then just append the filter to the other filters (as in the code snipped of the question) :

filter = MyFilter(Version.LUCENE_4_10_1, filter)

这篇关于Apache Lucene：如何在索引时使用TokenStream手动接受或拒绝令牌的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Lucene：如何在索引时使用TokenStream手动接受或拒绝令牌 [英] Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Apache Lucene：如何在索引时使用TokenStream手动接受或拒绝令牌 [英] Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭