正则表达式从列表中删除不是A-Z a-z的单词(例外) [英] regex to remove words from a list that are not A-Z a-z (exceptions)

查看:71
本文介绍了正则表达式从列表中删除不是A-Z a-z的单词(例外)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从字符串中删除非字母字符并将每个单词转换为列表组件,例如:

I would like to remove non-alpha characters from a string and convert each word into a list component such that:

"All, the above." -> ["all", "the", "above"]

似乎以下功能有效:

re.split('\W+', str)

但它不能解决极端情况.

but it does not account for corner cases.

例如:

"The U.S. is where it's nice." -> ["the", "U", "S", "is", "where", "it", "s", "nice"]

我希望删除句号,但不要撇号或美国"中的句点.

I want the period removed but neither the apostrophe or the periods in "U.S."

我的想法是创建一个正则表达式,将空格分开,然后删除多余的标点符号:

My idea is to create a regex where spaces are broken up but then remove extra punctuation:

"I, live at home." -> ["I", "live", "at", "home"] (comma and period removed)
"I J.C. live at home." -> ["I", "J.C.", "live", "at", "home"] (acronym periods not removed but end of sentence period removed)

对于像这样的句子,我想做的事情变得非常困难

What I'm trying to do becomes sufficiently difficult for sentences like:

"The flying saucer (which was green)." -> ["...", "green"] (ignore ").") 
"I J.C., live at home." -> ["I", "J.C.", "..."] (ignore punctuation)

特殊情况(从原始文本文件中检索字符串):

Special case (strings are retrieved from raw text file):

"I love you.<br /> Come home soon!" -> ["..."] (ignore breakpoint and punctuation) 

我对python来说还比较陌生,创建正则表达式令我感到困惑,因此,任何有关如何以这种方式解析字符串的帮助都将非常有帮助!如果这里有一个扣子22,让我知道并不是我试图完成的所有事情.

I am relatively new to python and creating regex's is confusing to me so any help on how to parse strings in this way would be very helpful!! If there is a catch 22 here, and not all things I am trying to accomplish are possible let me know.

推荐答案

尽管我了解您是专门询问正则表达式的,但另一个解决您总体问题的方法是使用一个用于此明确目的的库.例如 nltk .它应该可以帮助您以理智的方式拆分字符串(将适当的标点符号解析为列表中的单独项目),然后可以从此处进行过滤.

Although I understand you are asking specifically about regex, another solution to your overall problem is to use a library for this express purpose. For instance nltk. It should help you split your strings in sane ways (parsing out the proper punctuation into separate items in a list) which you can then filter out from there.

您是对的,正是由于人的语言不够精确和含糊不清,因此极端案例的数量正好庞大.使用已经说明了这些极端情况的库可以为您省去很多麻烦.

You are right, the number of corner cases is huge precisely because human language is imprecise and vague. Using a library that already accounts for these edge cases should save you a lot of headache.

此处是处理nltk中原始文本的有用入门.似乎对您的用例最有用的函数是 nltk.word_tokenize,它传回一个字符串列表,其中单词和标点符号分开.

A helpful primer on dealing with raw text in nltk is here. It seems the most useful function for your use case is nltk.word_tokenize, which passes back a list of strings with words and punctuation separated.

这篇关于正则表达式从列表中删除不是A-Z a-z的单词(例外)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆