Apache Lucene:如何在索引时使用TokenStream手动接受或拒绝令牌 [英] Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing
问题描述
我正在寻找一种用Apache Lucene编写自定义索引的方法(确切地说PyLucene,但Java答案很好)。
I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
我想要什么要做的是:当向索引添加文档时,Lucene会对其进行标记,删除停用词等。如果我没有弄错的话,通常使用 Analyzer
来完成。
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer
if I am not mistaken.
我想要实现的内容如下:在Lucene存储给定术语之前,我想执行查找(比如在字典中)来检查是否保留这个词或丢弃它(如果这个词存在于我的词典中,我保留它,否则我将其丢弃)。
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
我该怎么办?
这是(在Python中)我的分析器的自定义实现
:
Here is (in Python) my custom implementation of the Analyzer
:
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
提前感谢您的指导!
编辑1 :在深入研究Lucene的文档后,我认为它与 TokenStreamComponents
有关。它返回一个TokenStream,您可以使用它来遍历您正在编制索引的字段的Token列表。
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents
. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
现在与属性有关
我不明白。或者更确切地说,我可以读取令牌,但不知道我应该如何继续。
Now there is something to do with the Attributes
that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
编辑2 :我发现这是发布他们提到使用 CharTermAttribute
。但是(在Python中)我无法访问或获取 CharTermAttribute
。有什么想法?
EDIT 2 : I found this post where they mention the use of CharTermAttribute
. However (in Python though) I cannot access or get a CharTermAttribute
. Any thoughts ?
EDIT3 :我现在可以访问每个字词,请参阅更新代码段。现在剩下要做的事实上是存储所需的术语 ...
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...
推荐答案
这种方式我试图解决问题是错误的。这个发布和 femtoRgon 的答案是解决方案。
The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
通过定义扩展 PythonFilteringTokenFilter
的过滤器,我可以使用函数 accept()
(例如,在 StopFilter
中使用的那个)。
By defining a filter extending PythonFilteringTokenFilter
, I can make use of the function accept()
(as the one used in the StopFilter
for instance).
以下是相应的代码片段:
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
然后只需追加过滤到其他过滤器(如问题的代码片段):
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)
这篇关于Apache Lucene:如何在索引时使用TokenStream手动接受或拒绝令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!