正则表达式从列表中删除不是A-Z a-z的单词(例外) [英] regex to remove words from a list that are not A-Z a-z (exceptions)
问题描述
我想从字符串中删除非字母字符并将每个单词转换为列表组件,例如:
I would like to remove non-alpha characters from a string and convert each word into a list component such that:
"All, the above." -> ["all", "the", "above"]
似乎以下功能有效:
re.split('\W+', str)
但它不能解决极端情况.
but it does not account for corner cases.
例如:
"The U.S. is where it's nice." -> ["the", "U", "S", "is", "where", "it", "s", "nice"]
我希望删除句号,但不要撇号或美国"中的句点.
I want the period removed but neither the apostrophe or the periods in "U.S."
我的想法是创建一个正则表达式,将空格分开,然后删除多余的标点符号:
My idea is to create a regex where spaces are broken up but then remove extra punctuation:
"I, live at home." -> ["I", "live", "at", "home"] (comma and period removed)
"I J.C. live at home." -> ["I", "J.C.", "live", "at", "home"] (acronym periods not removed but end of sentence period removed)
对于像这样的句子,我想做的事情变得非常困难
What I'm trying to do becomes sufficiently difficult for sentences like:
"The flying saucer (which was green)." -> ["...", "green"] (ignore ").")
"I J.C., live at home." -> ["I", "J.C.", "..."] (ignore punctuation)
特殊情况(从原始文本文件中检索字符串):
Special case (strings are retrieved from raw text file):
"I love you.<br /> Come home soon!" -> ["..."] (ignore breakpoint and punctuation)
我对python来说还比较陌生,创建正则表达式令我感到困惑,因此,任何有关如何以这种方式解析字符串的帮助都将非常有帮助!如果这里有一个扣子22,让我知道并不是我试图完成的所有事情.
I am relatively new to python and creating regex's is confusing to me so any help on how to parse strings in this way would be very helpful!! If there is a catch 22 here, and not all things I am trying to accomplish are possible let me know.
推荐答案
尽管我了解您是专门询问正则表达式的,但另一个解决您总体问题的方法是使用一个用于此明确目的的库.例如 nltk
.它应该可以帮助您以理智的方式拆分字符串(将适当的标点符号解析为列表中的单独项目),然后可以从此处进行过滤.
Although I understand you are asking specifically about regex, another solution to your overall problem is to use a library for this express purpose. For instance nltk
. It should help you split your strings in sane ways (parsing out the proper punctuation into separate items in a list) which you can then filter out from there.
您是对的,正是由于人的语言不够精确和含糊不清,因此极端案例的数量正好庞大.使用已经说明了这些极端情况的库可以为您省去很多麻烦.
You are right, the number of corner cases is huge precisely because human language is imprecise and vague. Using a library that already accounts for these edge cases should save you a lot of headache.
此处是处理nltk中原始文本的有用入门.似乎对您的用例最有用的函数是 nltk.word_tokenize
,它传回一个字符串列表,其中单词和标点符号分开.
A helpful primer on dealing with raw text in nltk is here. It seems the most useful function for your use case is nltk.word_tokenize
, which passes back a list of strings with words and punctuation separated.
这篇关于正则表达式从列表中删除不是A-Z a-z的单词(例外)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!