nltk正则表达式标记器 [英] nltk regular expression tokenizer

查看:163
本文介绍了nltk正则表达式标记器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用python中的nltk实现一个正则表达式标记器,但是结果是这样的:

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

但是想要的结果是这样的:

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么?错误在哪里?

推荐答案

您应将所有捕获组都变为非捕获组:

You should turn all capturing groups to non-capturing:

  • ([A-Z]\.)+> (?:[A-Z]\.)+
  • \w+(-\w+)*-> \w+(?:-\w+)*
  • \$?\d+(\.\d+)?%?\$?\d+(?:\.\d+)?%?
  • ([A-Z]\.)+ > (?:[A-Z]\.)+
  • \w+(-\w+)* -> \w+(?:-\w+)*
  • \$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

问题在于,当在模式中定义了多个捕获组时,regexp_tokenize似乎正在使用re.findall来返回捕获元组列表.请参见 nltk.tokenize软件包参考:

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) –用于构建此令牌生成器的模式. (此模式不得包含捕获括号;请改用非捕获括号,例如(?:...))

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

此外,我不确定您是否要使用与包含所有大写字母的范围匹配的:-_,将-放在字符类的末尾.

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

因此,使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

这篇关于nltk正则表达式标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆