将令牌添加到 lucene 令牌流 [英] Adding tokens to a lucene tokenstream

查看:24
本文介绍了将令牌添加到 lucene 令牌流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个 TokenFilter 来在流中添加标记.

I wrote a TokenFilter which adds tokens in a stream.

如果有人能阐明语义,我将不胜感激.特别是在(*)处,恢复状态,是不是我们要么覆盖当前的token,要么覆盖捕获状态之前创建的token?

If someone could shed a light on the semantics I'd be grateful. In particular, at (*), restoring the state, doesn't that mean we either overwrite the current token or the token created before capturing the state?

这大概就是我所做的

private final LinkedList<String> extraTokens = new LinkedList<String>();
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private State savedState;

@Override
public boolean incrementToken() throws IOException {
    if (!extraTokens.isEmpty()) {
        // Do we not loose/overwrite the current termAtt token here? (*)
        restoreState(savedState);
        termAtt.setEmpty().append(extraTokens.remove());
        return true;
    }
    if (input.incrementToken()) {
        if (/* condition */) {
           extraTokens.add("fo");
           savedState = captureState();
        }
        return true;
    }
    return false;
}

这是否意味着,对于空白标记化字符串 "a b c"

Does that mean, for an input stream of whitespace tokenized string "a b c"

 (a) -> (b) -> (c) -> ...

其中bbb 的新同义词,那么当使用restoreState 时,图会像这样构造?

where bb is a new synonym to b, that the graph will be constructed like this when restoreState is used?

    (a)
   /   
(b)    (bb)
      /
    (c)
     |
    ...

2.属性

给定文本 foo bar baz,其中 fofoo 的词干,qux 的同义词code>bar baz,我构造了正确的属性表吗?

2. Attributes

Given the text foo bar baz with fo being the stem of foo and qux being synonym to bar baz, have I constructed the correct attribute table?

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  foo   |       0       |     3     |      1       |     1     |
|  fo    |       0       |     3     |      0       |     1     |
|  qux   |       4       |     11    |      0       |     2     |
|  bar   |       4       |     7     |      1       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

推荐答案

1.

基于属性的 API 的工作原理是,分析器链中的每个 TokenStream 都会在每次调用 incrementToken() 时以某种方式修改某些 Attribute 的状态.然后链中的最后一个元素产生最终的令牌.

How the Attribute based API works is, that every TokenStream in your analyzer chain somehow modifies the state of some Attributes on every call of incrementToken(). The last element in your chain then produces the final tokens.

每当您的分析器链的客户端调用 incrementToken() 时,最后一个 TokenStream 会将某些 Attribute 的状态设置为代表下一个令牌所必需的.如果它不能这样做,它可以在其输入上调用 incrementToken(),让之前的 TokenStream 完成它的工作.这一直持续到最后一个 TokenStream 返回 false,表明没有更多可用的令牌.

Whenever the client of your analyzer chain calls incrementToken(), the last TokenStream would set the state of some Attributes to whatever is necessary to represent the next token. If it is unable to do so, it may call incrementToken() on its input, to let the previous TokenStream do its work. This goes on until the last TokenStream returns false, indicating, that no more tokens are available.

一个captureState将调用TokenStream的所有Attribute的状态复制到一个State中,一个restoreState 用之前捕获的任何内容(作为参数给出)覆盖每个 Attribute 的状态.

A captureState copies the state of all Attributes of the calling TokenStream into a State, a restoreState overwrites every Attribute's state with whatever was captured before (is given as an argument).

你的token过滤器的工作方式是,它会调用input.incrementToken(),这样前面的TokenStream就会设置Attributes' 状态到下一个令牌是什么.然后,如果您定义的条件成立(例如,termAtt 是b"),它会将bb"添加到堆栈中,将此状态保存在某处并返回 true,以便客户端可以使用令牌.在下一次调用 incrementToken() 时,它不会使用 input.incrementToken().无论当前状态如何,它都代表之前已经消耗的令牌.然后过滤器恢复状态,使一切都和之前完全一样,然后产生bb"作为当前令牌并返回true,以便客户端可以消费令牌.只有在下一次调用时,它才会(再次)使用上一个过滤器中的下一个标记.

The way your token filter works is, it will call input.incrementToken(), so that the previous TokenStream will set the Attributes' state to what would be the next token. Then, if your defined condition holds (say, the termAtt is "b"), it would add "bb" to a stack, save this state somewhere and return true, so that the client may consume the token. On the next call of incrementToken(), it would not use input.incrementToken(). Whatever the current state is, it represents the previous, already consumed token. The filter then restores the state, so that everything is exactly as it was before, and then produces "bb" as the current token and returns true, so that the client may consume the token. Only on the next call, it would (again) consume the next token from the previous filter.

这实际上不会产生你显示的图形,而是在"b"之后插入"bb",所以它真的

This won't actually produce the graph you displayed, but insert "bb" after "b", so it's really

(a) -> (b) -> (bb) -> (c)

那么,首先为什么要保存状态?在生产令牌时,您要确保,例如短语查询或突出显示将正常工作.当您有文本 "abc" 并且 "bb""b" 的同义词时,您会期望短语 query "bc" 工作,以及 "bb c".你必须告诉索引,b"和bb"都在同一个位置.Lucene 为此使用位置增量,并且默认情况下,位置增量为 1,这意味着每个新标记(读取、调用 incrementToken())都在前一个位置之后出现 1 个位置.因此,对于最终位置,生产流是

So, why do you save the state in the first place? When producing tokens, you want to make sure, that e.g. phrase queries or highlighting will work correctly. When you have the text "a b c" and "bb" is a synonym for "b", you'd expect the phrase query "b c" to work, as well as "bb c". You have to tell the index, that both, "b" and "bb" are in the same position. Lucene uses a position increment for that and per default, the position increment is 1, meaning that every new token (read, call of incrementToken()) comes 1 position after the previous one. So, with the final positions, the produces stream is

(a:1) -> (b:2) -> (bb:3) -> (c:4)

当你真正想要的时候

(a:1) — -> (b:2)  -> — (c:3)
                    /
        -> (bb:2) ->

因此,要使过滤器生成图形,您必须将插入的 "bb"

So, for your filter to produce the graph, you have to set the position increment to 0 for the inserted "bb"

private final PositionIncrementAttribute posIncAtt = addAttribute(PositionIncrementAttribute.class);
// later in incrementToken
restoreState(savedState);
posIncAtt.setPositionIncrement(0);
termAtt.setEmpty().append(extraTokens.remove());

restoreState 确保保留其他属性,例如偏移量、令牌类型等,您只需更改用例所需的属性.是的,您正在覆盖 restoreState 之前存在的任何状态,因此您有责任在正确的位置使用它.只要你不调用input.incrementToken(),你就不会推进输入流,所以你可以对状态做任何你想做的事情.

The restoreState makes sure, that other attributes, like offsets, token types, etc. are preserved and you only have to change the ones, that are required for your use case. Yes, you are overwriting whatever state was there before restoreState, so it is your responsibility to use this in the right place. And as long as you don't call input.incrementToken(), you don't advance the input stream, so you can do whatever you want with the state.

2.

词干分析器只改变标记,它通常不会产生新的标记,也不会改变位置增量或偏移量.此外,由于位置增量意味着当前术语应该在前一个标记之后 positionIncrement 位置,您应该有 qux 增量为 1,因为它是下一个ofbar 后面的 token 应该有 0 的增量,因为它和 qux 处于相同的位置.该表宁愿看起来像

A stemmer only changes the token, it typically doesn't produce new tokens nor changes the position increment or offsets. Also, as the position increment means, that the current term should come positionIncrement positions after the previous token, you should have qux with an increment of 1, because it is the next token after of and bar should have an increment of 0 because it is in the same position as qux. The table would rather look like

+--------+---------------+-----------+--------------+-----------+
|  Term  |  startOffset  | endOffset | posIncrement | posLenght |
+--------+---------------+-----------+--------------+-----------+
|  fo    |       0       |     3     |      1       |     1     |
|  qux   |       4       |     11    |      1       |     2     |
|  bar   |       4       |     7     |      0       |     1     |
|  baz   |       8       |     11    |      1       |     1     |
+--------+---------------+-----------+--------------+-----------+

作为一个基本规则,对于多术语同义词,其中ABC"是a b c"的同义词,您应该看到

As a basic rule, for multi-term synonyms, where "ABC" is a synonym for "a b c", you should see, that

  • positionIncrement("ABC") > 0(第一个标记的增量)
  • positionIncrement(*) >= 0(位置不能倒退)
  • startOffset("ABC") == startOffset("a") 和 endOffset("ABC") == endOffset("c")
    • 实际上,相同(开始|结束)位置的标记必须具有相同的(开始|结束)偏移量

    希望这有助于阐明.

    这篇关于将令牌添加到 lucene 令牌流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆