将令牌添加到Lucene令牌流 [英] Adding tokens to a lucene tokenstream

查看:74
本文介绍了将令牌添加到Lucene令牌流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个TokenFilter,它在流中添加了令牌.

I wrote a TokenFilter which adds tokens in a stream.

如果有人可以阐明这些语义,我将不胜感激.特别是,在(*)处,恢复状态,这是否意味着我们要么覆盖当前令牌,要么覆盖捕获状态之前创建的令牌?

If someone could shed a light on the semantics I'd be grateful. In particular, at (*), restoring the state, doesn't that mean we either overwrite the current token or the token created before capturing the state?

这大致就是我所做的

private final LinkedList<String> extraTokens = new LinkedList<String>();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private State savedState;

@Override
public boolean incrementToken() throws IOException {
    if (!extraTokens.isEmpty()) {
        // Do we not loose/overwrite the current termAtt token here? (*)
        restoreState(savedState);
        termAtt.setEmpty().append(extraTokens.remove());
        return true;
    }
    if (input.incrementToken()) {
        if (/* condition */) {
           extraTokens.add("fo");
           savedState = captureState();
        }
        return true;
    }
    return false;
}

这意味着对于空白标记化字符串"a b c"

Does that mean, for an input stream of whitespace tokenized string "a b c"

 (a) -> (b) -> (c) -> ...

其中bbb的新同义词,当使用restoreState时,图形将像这样构造吗?

where bb is a new synonym to b, that the graph will be constructed like this when restoreState is used?

    (a)
   /   \
(b)    (bb)
   \   /
    (c)
     |
    ...

2.属性

给出文本foo bar baz,其中fofoo的主干,而quxbar baz的同义词,我是否构造了正确的属性表?

2. Attributes

Given the text foo bar baz with fo being the stem of foo and qux being synonym to bar baz, have I constructed the correct attribute table?

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  foo   |       0       |     3     |      1       |     1     |
|  fo    |       0       |     3     |      0       |     1     |
|  qux   |       4       |     11    |      0       |     2     |
|  bar   |       4       |     7     |      1       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

推荐答案

1.

基于属性的API的工作方式是,分析器链中的每个TokenStream都会以某种方式修改每次incrementToken()调用时某些Attribute的状态.然后,链中的最后一个元素将产生最终令牌.

How the Attribute based API works is, that every TokenStream in your analyzer chain somehow modifies the state of some Attributes on every call of incrementToken(). The last element in your chain then produces the final tokens.

每当分析器链的客户端调用incrementToken()时,最后一个TokenStream都会将某些Attribute的状态设置为表示下一个令牌所需的任何状态.如果无法执行此操作,则可以在其输入上调用incrementToken(),以让先前的TokenStream进行其工作.这一直持续到最后一个TokenStream返回false为止,表明不再有可用的令牌.

Whenever the client of your analyzer chain calls incrementToken(), the last TokenStream would set the state of some Attributes to whatever is necessary to represent the next token. If it is unable to do so, it may call incrementToken() on its input, to let the previous TokenStream do its work. This goes on until the last TokenStream returns false, indicating, that no more tokens are available.

A captureState将调用TokenStream的所有Attribute的状态复制到State中,restoreState用之前捕获的内容覆盖每个Attribute的状态(以论点).

A captureState copies the state of all Attributes of the calling TokenStream into a State, a restoreState overwrites every Attribute's state with whatever was captured before (is given as an argument).

令牌过滤器的工作方式是,它将调用input.incrementToken(),这样前一个TokenStream会将Attribute s的状态设置为下一个令牌.然后,如果您定义的条件成立(例如,termAtt为"b"),它将在堆栈中添加"bb",将此状态保存在某处并返回true,以便客户端可以使用令牌.在下一次调用incrementToken()时,它将不使用input.incrementToken().无论当前状态是什么,它都代表先前已消耗的令牌.然后,筛选器将还原状态,以使所有内容都与以前一样,然后生成"bb"作为当前令牌并返回true,以便客户端可以使用该令牌.仅在下一次调用时,它将(再次)消耗上一个过滤器中的下一个令牌.

The way your token filter works is, it will call input.incrementToken(), so that the previous TokenStream will set the Attributes' state to what would be the next token. Then, if your defined condition holds (say, the termAtt is "b"), it would add "bb" to a stack, save this state somewhere and return true, so that the client may consume the token. On the next call of incrementToken(), it would not use input.incrementToken(). Whatever the current state is, it represents the previous, already consumed token. The filter then restores the state, so that everything is exactly as it was before, and then produces "bb" as the current token and returns true, so that the client may consume the token. Only on the next call, it would (again) consume the next token from the previous filter.

这实际上不会产生您显示的图形,但是会在"b"之后插入"bb",因此确实如此

This won't actually produce the graph you displayed, but insert "bb" after "b", so it's really

(a) -> (b) -> (bb) -> (c)

那么,为什么首先要保存状态? 生成令牌时,您需要确保词组查询或突出显示将正常工作.当您有文本"a b c"并且"bb""b"的同义词时,您希望短语查询"b c""bb c"都能正常工作.您必须告诉索引,"b"和"bb"都在同一位置. Lucene为此使用一个位置增量,并且默认情况下,位置增量为1,这意味着每个新令牌(读取,调用incrementToken())在前一个位置之后移1个位置.因此,在最终位置,农产品流是

So, why do you save the state in the first place? When producing tokens, you want to make sure, that e.g. phrase queries or highlighting will work correctly. When you have the text "a b c" and "bb" is a synonym for "b", you'd expect the phrase query "b c" to work, as well as "bb c". You have to tell the index, that both, "b" and "bb" are in the same position. Lucene uses a position increment for that and per default, the position increment is 1, meaning that every new token (read, call of incrementToken()) comes 1 position after the previous one. So, with the final positions, the produces stream is

(a:1) -> (b:2) -> (bb:3) -> (c:4)

实际需要的时间

(a:1) — -> (b:2)  -> — (c:3)
      \              /
        -> (bb:2) ->

因此,要使您的过滤器生成图形,您必须将插入的"bb"

So, for your filter to produce the graph, you have to set the position increment to 0 for the inserted "bb"

private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
// later in incrementToken
restoreState(savedState);
posIncAtt.setPositionIncrement(0);
termAtt.setEmpty().append(extraTokens.remove());

restoreState确保保留其他属性,例如偏移量,令牌类型等,并且您只需要更改用例所需的那些属性即可. 是的,您要覆盖restoreState之前的状态,因此在正确的位置使用它是您的责任.而且,只要不调用input.incrementToken(),就不会提前输入流,因此您可以对状态进行任何操作.

The restoreState makes sure, that other attributes, like offsets, token types, etc. are preserved and you only have to change the ones, that are required for your use case. Yes, you are overwriting whatever state was there before restoreState, so it is your responsibility to use this in the right place. And as long as you don't call input.incrementToken(), you don't advance the input stream, so you can do whatever you want with the state.

2.

词干提取器仅更改令牌,通常不会产生新令牌,也不会更改位置增量或偏移量. 同样,由于位置增加的意思是,当前项应位于前一个标记之后的positionIncrement个位置,因此您应该具有qux,其增量为1,因为它是of之后的下一个标记,而bar应该增量为0,因为它与qux处于同一位置.桌子看起来就像

A stemmer only changes the token, it typically doesn't produce new tokens nor changes the position increment or offsets. Also, as the position increment means, that the current term should come positionIncrement positions after the previous token, you should have qux with an increment of 1, because it is the next token after of and bar should have an increment of 0 because it is in the same position as qux. The table would rather look like

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  fo    |       0       |     3     |      1       |     1     |
|  qux   |       4       |     11    |      1       |     2     |
|  bar   |       4       |     7     |      0       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

作为一个基本规则,对于多词同义词,其中"ABC"是"a b c"的同义词,您应该看到,

As a basic rule, for multi-term synonyms, where "ABC" is a synonym for "a b c", you should see, that

  • positionIncrement("ABC")> 0(第一个令牌的增量)
  • positionIncrement(*)> = 0(位置不得向后移动)
  • startOffset("ABC")== startOffset("a")和endOffset("ABC")== endOffset("c")
    • 实际上,位于相同(开始|结束)位置的令牌必须具有相同的(开始|结束)偏移量
    • positionIncrement("ABC") > 0 (the increment of the first token)
    • positionIncrement(*) >= 0 (positions must not go backwards)
    • startOffset("ABC") == startOffset("a") and endOffset("ABC") == endOffset("c")
      • actually, tokens at the same (start|end) position must have the same (start|end) offset

      希望这有助于阐明一些观点.

      Hope this helps to shed some light.

      这篇关于将令牌添加到Lucene令牌流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆