nltk正则表达式标记器 [英] nltk regular expression tokenizer

查看：163 发布时间：2020/5/18 1:12:22 python regex pattern-matching nltk

本文介绍了nltk正则表达式标记器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用python中的nltk实现一个正则表达式标记器，但是结果是这样的:

I tried to implement a regular expression tokenizer with nltk in python, but the result is this:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

但是想要的结果是这样的:

But the wanted result is this:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么?错误在哪里?

推荐答案

您应将所有捕获组都变为非捕获组:

You should turn all capturing groups to non-capturing:

([A-Z]\.)+> (?:[A-Z]\.)+
\w+(-\w+)*-> \w+(?:-\w+)*
\$?\d+(\.\d+)?%?至\$?\d+(?:\.\d+)?%?

([A-Z]\.)+ > (?:[A-Z]\.)+
\w+(-\w+)* -> \w+(?:-\w+)*
\$?\d+(\.\d+)?%? to \$?\d+(?:\.\d+)?%?

问题在于，当在模式中定义了多个捕获组时，regexp_tokenize似乎正在使用re.findall来返回捕获元组列表.请参见此 nltk.tokenize软件包参考:

The issue is that regexp_tokenize seems to be using re.findall that returns capture tuple lists when multiple capture groups are defined in the pattern. See this nltk.tokenize package reference:

pattern (str) –用于构建此令牌生成器的模式. (此模式不得包含捕获括号；请改用非捕获括号，例如(?:...))

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

此外，我不确定您是否要使用与包含所有大写字母的范围匹配的:-_，将-放在字符类的末尾.

Also, I am not sure you wanted to use :-_ that matches a range including all uppercase letters, put the - to the end of the character class.

因此，使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

这篇关于nltk正则表达式标记器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

nltk正则表达式标记器 [英] nltk regular expression tokenizer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

nltk正则表达式标记器 [英] nltk regular expression tokenizer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭