pylucene的自定义标记生成器,仅基于下划线标记文本(保留空格) [英] Custom Tokenizer for pylucene which tokenizes text based only on underscores (retains spaces)

查看:43
本文介绍了pylucene的自定义标记生成器,仅基于下划线标记文本(保留空格)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是pylucene的新手,我正在尝试构建一个自定义分析器,该分析器仅在下划线的基础上对文本进行标记化,即它应保留空白.示例:应将"Hi_this is_awesome"令牌标记为["hi","this is","awesome"]令牌.

I am new to pylucene and I am trying to build a custom analyzer which tokenizes text on the basis of underscores only, i.e. it should retain the whitespaces. Example: "Hi_this is_awesome" should be tokenized into ["hi", "this is", "awesome"] tokens.

从各种代码示例中,我了解到我需要为CustomTokenizer重写increasingToken方法,并编写一个CustomAnalyzer,TokenStream需要使用CustomTokenizer,然后再使用LowerCaseFilter,以实现相同的功能.

From various code examples I understood that I need to override the incrementToken method for a CustomTokenizer and write a CustomAnalyzer for which the TokenStream needs to use the CustomTokenizer followed by a LowerCaseFilter to achieve the same.

由于在pylucene上可用的文档很少,我在实现增量令牌方法和连接点时遇到问题(通常使用令牌生成器的方式取决于分析器依赖于令牌过滤器的TokenFilter).

I am facing problems in implementing the incrementToken method and connecting the dots (how the tokenizer maybe used as usually the Analyzers depend on TokenFilter which depend on TokenStreams) as there is very little documentation available on pylucene.

推荐答案

通过创建一个新的tokenzier使其最终工作,该令牌将除下划线以外的每个字符都视为所生成令牌的一部分(基本上,下划线成为分隔符)

Got it working eventually by creating a new tokenzier which considered every char other than an underscore as part of the token generated (basically underscore becomes the separator)


class UnderscoreSeparatorTokenizer(PythonCharTokenizer):
  def __init__(self, input):
    PythonCharTokenizer.__init__(self, input)

  def isTokenChar(self, c):
    return c != "_"

class UnderscoreSeparatorAnalyzer(PythonAnalyzer):
  def __init__(self, version):
    PythonAnalyzer.__init__(self, version)

  def tokenStream(self, fieldName, reader):
    tokenizer = UnderscoreSeparatorTokenizer(reader)
    tokenStream = LowerCaseFilter(tokenizer)
    return tokenStream

这篇关于pylucene的自定义标记生成器,仅基于下划线标记文本(保留空格)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆