Python正则表达式:标记化英语收缩 [英] Python regex: tokenizing English contractions

查看:79
本文介绍了Python正则表达式:标记化英语收缩的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试以这样一种方式来解析字符串,以便分离出所有单词组成部分,甚至包括那些已收缩的单词组成部分.例如,应"的标记化应为[应",不"].

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].

nltk模块似乎无法完成任务,但是:

The nltk module does not seem to be up to the task however as:

我不会那样做的."

"I wouldn't've done that."

标记为:

['我','不会',''已经','完成','那个','.']

['I', "wouldn't", "'ve", 'done', 'that', '.']

将要"的所需标记化为:['将',"n't",'ve"]

where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]

在检查了常见的英语收缩之后,我试图编写一个正则表达式来完成这项工作,但是我很难弄清楚如何只匹配一次've".例如,以下标记都可以终止收缩:

After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:

n't,'ve,'d,'ll,'s,'m,'re

n't, 've, 'd, 'll, 's, 'm, 're

但是标记've"也可以跟随其他收缩,例如:

But the token "'ve" can also follow other contractions such as:

有,没有和(可以想象)有

'd've, n't've, and (conceivably) 'll've

此刻,我正试图纠缠此正则表达式:

At the moment, I am trying to wrangle this regex:

\ b [a-zA-Z] +(?:('d |'ll | n't)('ve)?)|('s |'m |'re |'ve)\ b

\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b

但是,此模式也与格式错误的匹配:

However, this pattern also matches the badly formed:

不会"

问题似乎在于,第三个撇号符合单词边界条件,因此最终的've"令牌与整个正则表达式匹配.

It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.

我一直无法想到一种区分单词边界和撇号的方法,否则,我很乐意接受其他策略的建议.

I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.

此外,我很好奇是否有任何方法可以在字符类中包括单词边界特殊字符.根据Python文档,字符类中的\ b匹配一个退格键,似乎没有办法解决这个问题.

Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.

这是输出:

>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

我不知道第三场比赛.特别是,我只是意识到,如果第三个撇号与前导\ b相匹配,那么我不知道与字符类[a-zA-Z] +相匹配的是什么.

I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.

推荐答案

(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])

\ 2是匹配项,\ 3是第一个组,\ 4是第二个组,\ 5是第三个组.

\2 is the match, \3 is the first group, \4 the second and \5 the third.

这篇关于Python正则表达式:标记化英语收缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆