正则表达式在 Python 中拆分单词 [英] Regex to split words in Python

查看:54
本文介绍了正则表达式在 Python 中拆分单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个正则表达式来从给定的文本分割所有实际单词:

I was designing a regex to split all the actual words from a given text:


输入示例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"


预期输出:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]



我想到了这样的正则表达式:



I thought of a regex like that:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

在 Python 中拆分后,结果包含 None 项和空格.

After splitting in Python, the result contains None items and empty spaces.

如何去掉 None 项?为什么空格不匹配?

How to get rid of the None items? And why didn't the spaces match?



在空格上拆分,将给出如下项目:["there."]
并在非字母上拆分,将给出如下项目:["John","s"]
并在除 ' 之外的非字母上拆分,将给出以下项目:["'Where","you'"]



Splitting on spaces, will give items like: ["there."]
And splitting on non-letters, will give items like: ["John","s"]
And splitting on non-letters except ', will give items like: ["'Where","you'"]

推荐答案

你可以使用字符串函数代替正则表达式:

Instead of regex, you can use string-functions:

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

但是,在您的示例中,您不想删除 John's 中的撇号,但希望删除 you!!' 中的撇号.所以字符串操作在这一点上失败了,你需要一个微调的正则表达式.

BUT, in your example you do not want to remove apostrophe in John's but you wish to remove it in you!!'. So string operations fails in that point and you need a finely adjusted regex.

可能一个简单的正则表达式可以解决您的问题:

probably a simple regex can solve your porblem:

(\w[\w']*)

它将捕获所有以字母开头的字符,并在下一个字符是撇号或字母时继续捕获.

It will capture all chars that starts with a letter and keep capturing while next char is an apostrophe or letter.

(\w[\w']*\w)

这第二个正则表达式是针对一个非常特殊的情况......第一个正则表达式可以捕获像you'这样的词.这个将避免这一点,并且只有在单词中(不在开头或结尾)时才会捕获撇号.但是在这一点上,出现了这样的情况,您无法使用第二个正则表达式捕获撇号 Moss' mom.您必须决定是否在以 s 结尾并定义所有权的名称中捕获尾随撇号.

This second regex is for a very specific situation.... First regex can capture words like you'. This one will aviod this and only capture apostrophe if is is within the word (not in the beginning or in the end). But in that point, a situation raises like, you can not capture the apostrophe Moss' mom with the second regex. You must decide whether you will capture trailing apostrophe in names ending wit s and defining ownership.

示例:

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新 2:我在我的正则表达式中发现了一个错误!它无法捕获单个字母后跟像 A' 这样的撇号.修复了全新的正则表达式:

UPDATE 2: I found a bug in my regex! It can not capture single letters followed by an apostrophe like A'. Fixed brand new regex is here:

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

这篇关于正则表达式在 Python 中拆分单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆