从字符串中提取单词,删除标点符号并返回带有分隔单词的列表 [英] Extracting words from a string, removing punctuation and returning a list with separated words
问题描述
我想知道如何实现函数get_words()
,该函数返回列表中字符串中的单词,删除标点符号.
I was wondering how to implement a function get_words()
that returns the words in a string in a list, stripping away the punctuation.
我希望如何实现它是将非string.ascii_letters
替换为''
并返回.split()
.
How I would like to have it implemented is replace non string.ascii_letters
with ''
and return a .split()
.
def get_words(text):
'''The function should take one argument which is a string'''
returns text.split()
例如:
>>>get_words('Hello world, my name is...James!')
返回:
>>>['Hello', 'world', 'my', 'name', 'is', 'James']
推荐答案
这与拆分和标点无关.您只需要关心字母(和数字),并只需要一个正则表达式即可:
This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:
import re
def getWords(text):
return re.compile('\w+').findall(text)
演示:
>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']
如果您不关心数字,则将\w
替换为[A-Za-z]
仅用于字母,或者将[A-Za-z']
替换为包含缩略语等.可能存在更奇妙的方式来包括字母非数字字符类(例如带有重音符号的字母)与其他正则表达式.
If you don't care about numbers, replace \w
with [A-Za-z]
for just letters, or [A-Za-z']
to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.
我在这里几乎回答了这个问题:使用多个定界符分割字符串吗?
I almost answered this question here: Split Strings with Multiple Delimiters?
但是您的问题实际上没有明确说明:您是否想将'this is: an example'
拆分为:
But your question is actually under-specified: Do you want 'this is: an example'
to be split into:
-
['this', 'is', 'an', 'example']
- 还是
['this', 'is', 'an', '', 'example']
?
['this', 'is', 'an', 'example']
- or
['this', 'is', 'an', '', 'example']
?
我认为这是第一种情况.
I assumed it was the first case.
[this','is','an',example']是我想要的.有没有不导入正则表达式的方法?如果我们可以将非ascii_letters替换为",然后将字符串拆分成列表中的单词,那行得通吗? –詹姆斯·史密斯2分钟前
[this', 'is', 'an', example'] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? – James Smith 2 mins ago
regexp是最优雅的,但是可以的,您可以这样做,如下所示:
The regexp is the most elegant, but yes, you could this as follows:
def getWords(text):
"""
Returns a list of words, where a word is defined as a
maximally connected substring of uppercase or lowercase
alphabetic letters, as defined by "a".isalpha()
>>> get_words('Hello world, my name is... Élise!') # works in python3
['Hello', 'world', 'my', 'name', 'is', 'Élise']
"""
return ''.join((c if c.isalnum() else ' ') for c in text).split()
或.isalpha()
旁注:您还可以执行以下操作,尽管它需要导入另一个标准库:
Sidenote: You could also do the following, though it requires importing another standard library:
from itertools import *
# groupby is generally always overkill and makes for unreadable code
# ... but is fun
def getWords(text):
return [
''.join(chars)
for isWord,chars in
groupby(' My name, is test!', lambda c:c.isalnum())
if isWord
]
如果这是家庭作业,他们可能正在寻找一种当务之急,例如两状态有限状态机,其中状态为最后一个字符是字母",并且如果状态从字母变为->非字母,则您输出一个词.不要那样做这不是编程的好方法(尽管有时抽象是有用的).
If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. Don't do that; it's not a good way to program (though sometimes the abstraction is useful).
这篇关于从字符串中提取单词,删除标点符号并返回带有分隔单词的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!