NLTK正则表达式令牌生成器在正则表达式中不能很好地使用小数点 [英] NLTK regexp tokenizer not playing nice with decimal point in regex

查看:110
本文介绍了NLTK正则表达式令牌生成器在正则表达式中不能很好地使用小数点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写文本规范化器,需要处理的一种基本情况是将3.14转换为three point one fourthree point fourteen.

I'm trying to write a text normalizer, and one of the basic cases that needs to be handled is turning something like 3.14 to three point one four or three point fourteen.

我目前正在将\$?\d+(\.\d+)?%?nltk.regexp_tokenize一起使用,我认为它应该处理数字以及货币和百分比.但是,目前,可以很好地处理$23.50之类的东西(解析为['$23.50']),但是3.14的解析为['3', '14']-小数点被删除了.

I'm currently using the pattern \$?\d+(\.\d+)?%? with nltk.regexp_tokenize, which I believe should handle numbers as well as currency and percentages. However, at the moment, something like $23.50 is handled perfectly (it parses to ['$23.50']), but 3.14 is parsing to ['3', '14'] - the decimal point is being dropped.

我尝试将一个单独的模式\d+.\d+添加到我的正则表达式中,但这无济于事(我的当前模式不应该已经与之匹配吗?)

I've tried adding a pattern separate \d+.\d+ to my regexp, but that didn't help (and shouldn't my current pattern match that already?)

编辑2 :我还发现%部分似乎也无法正常工作-20%仅返回['20'].我觉得我的正则表达式肯定有问题,但是我已经在Pythex中对其进行了测试,看起来还可以吗?

Edit 2: I also just discovered that the % part doesn't seem to be working correctly either - 20% returns just ['20']. I feel like there must be something wrong with my regexp, but I've tested it in Pythex and it seems fine?

编辑:这是我的代码.

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

以下是我的一些测试字符串:

Here are some of my test strings:

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

推荐答案

罪魁祸首是:

\w+([-']\w+)*

\w+将匹配数字,并且由于那里没有.,因此它将仅匹配3.14中的3.将选项稍微移动一点,使\$?\d+(\.\d+)?%?在上述正则表达式部分之前(以便首先尝试在数字格式上进行匹配):

\w+ will match numbers and since there's no . there, it will match only 3 in 3.14. Move the options around a bit so that \$?\d+(\.\d+)?%? is before the above regex part (so that the match is attempted first on the number format):

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101演示

或以展开形式:

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

这篇关于NLTK正则表达式令牌生成器在正则表达式中不能很好地使用小数点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆