从字符串中提取单词,删除标点符号并返回带有分隔单词的列表 [英] Extracting words from a string, removing punctuation and returning a list with separated words

查看:199
本文介绍了从字符串中提取单词,删除标点符号并返回带有分隔单词的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何实现函数get_words(),该函数返回列表中字符串中的单词,删除标点符号.

I was wondering how to implement a function get_words() that returns the words in a string in a list, stripping away the punctuation.

我希望如何实现它是将非string.ascii_letters替换为''并返回.split().

How I would like to have it implemented is replace non string.ascii_letters with '' and return a .split().

def get_words(text):

    '''The function should take one argument which is a string'''

    returns text.split()

例如:

>>>get_words('Hello world, my name is...James!')

返回:

>>>['Hello', 'world', 'my', 'name', 'is', 'James']

推荐答案

这与拆分和标点无关.您只需要关心字母(和数字),并只需要一个正则表达式即可:

This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:

import re
def getWords(text):
    return re.compile('\w+').findall(text)

演示:

>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']

如果您不关心数字,则将\w替换为[A-Za-z]仅用于字母,或者将[A-Za-z']替换为包含缩略语等.可能存在更奇妙的方式来包括字母非数字字符类(例如带有重音符号的字母)与其他正则表达式.

If you don't care about numbers, replace \w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.

我在这里几乎回答了这个问题:使用多个定界符分割字符串吗?

I almost answered this question here: Split Strings with Multiple Delimiters?

但是您的问题实际上没有明确说明:您是否想将'this is: an example'拆分为:

But your question is actually under-specified: Do you want 'this is: an example' to be split into:

  • ['this', 'is', 'an', 'example']
  • 还是['this', 'is', 'an', '', 'example']?
  • ['this', 'is', 'an', 'example']
  • or ['this', 'is', 'an', '', 'example']?

我认为这是第一种情况.

I assumed it was the first case.

[this','is','an',example']是我想要的.有没有不导入正则表达式的方法?如果我们可以将非ascii_letters替换为",然后将字符串拆分成列表中的单词,那行得通吗? –詹姆斯·史密斯2分钟前

[this', 'is', 'an', example'] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? – James Smith 2 mins ago

regexp是最优雅的,但是可以的,您可以这样做,如下所示:

The regexp is the most elegant, but yes, you could this as follows:

def getWords(text):
    """
        Returns a list of words, where a word is defined as a
        maximally connected substring of uppercase or lowercase
        alphabetic letters, as defined by "a".isalpha()

        >>> get_words('Hello world, my name is... Élise!')  # works in python3
        ['Hello', 'world', 'my', 'name', 'is', 'Élise']
    """
    return ''.join((c if c.isalnum() else ' ') for c in text).split()

.isalpha()

旁注:您还可以执行以下操作,尽管它需要导入另一个标准库:

Sidenote: You could also do the following, though it requires importing another standard library:

from itertools import *

# groupby is generally always overkill and makes for unreadable code
# ... but is fun

def getWords(text):
    return [
        ''.join(chars)
            for isWord,chars in 
            groupby(' My name, is test!', lambda c:c.isalnum()) 
            if isWord
    ]


如果这是家庭作业,他们可能正在寻找一种当务之急,例如两状态有限状态机,其中状态为最后一个字符是字母",并且如果状态从字母变为->非字母,则您输出一个词.不要那样做这不是编程的好方法(尽管有时抽象是有用的).


If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. Don't do that; it's not a good way to program (though sometimes the abstraction is useful).

这篇关于从字符串中提取单词,删除标点符号并返回带有分隔单词的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆