Python 正则表达式:标记英语缩写 [英] Python regex: tokenizing English contractions

查看：31 发布时间：2022/1/2 17:35:07 python regex pattern-matching nlp

本文介绍了Python 正则表达式:标记英语缩写的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图以分离出所有单词组件的方式解析字符串，即使是那些已经收缩的组件.例如，不应该"的标记化将是 ["should", "n't"].

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].

nltk 模块似乎不能胜任这项任务:

The nltk module does not seem to be up to the task however as:

我不会那样做."

标记为:

['我', "不会", "'已经", '完成', '那个', '.']

['I', "wouldn't", "'ve", 'done', 'that', '.']

would't've"所需的标记化是:['would', "n't", "'ve"]

where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]

在检查了常见的英语缩写之后，我正在尝试编写一个正则表达式来完成这项工作，但我很难弄清楚如何只匹配一次've".例如，以下标记都可以终止收缩:

After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:

n't, 've, 'd, 'll, 's, 'm, 're

n't, 've, 'd, 'll, 's, 'm, 're

但是标记've"也可以遵循其他缩写形式，例如:

But the token "'ve" can also follow other contractions such as:

'd've，n't've，并且(可以想象)''ll've

'd've, n't've, and (conceivably) 'll've

目前，我正在尝试解决这个正则表达式:

At the moment, I am trying to wrangle this regex:

[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)

[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)

然而，这个模式也匹配了错误的格式:

However, this pattern also matches the badly formed:

不会"

问题似乎在于第三个撇号符合单词边界的条件，因此最终的've"标记与整个正则表达式匹配.

It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.

我一直无法想出一种方法来区分单词边界和撇号，如果做不到这一点，我愿意接受替代策略的建议.

I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.

另外，我很好奇是否有任何方法可以在字符类中包含单词边界特殊字符.根据 Python 文档，字符类中的与退格符匹配，似乎没有办法解决这个问题.

Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, in a character class matches a backspace and there doesn't seem to be a way around this.

输出如下:

>>>pattern = re.compile(r"[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

我不知道第三场比赛.特别是，我刚刚意识到如果第三个撇号匹配前导，那么我不知道什么会匹配字符类 [a-zA-Z]+.

I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading , then I don't know what would be matching the character class [a-zA-Z]+.

Python 正则表达式:标记英语缩写 [英] Python regex: tokenizing English contractions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python 正则表达式:标记英语缩写 [英] Python regex: tokenizing English contractions

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭