将字符串拆分为单词和标点符号 [英] Splitting a string into words and punctuation
问题描述
我正在尝试将字符串拆分为单词和标点符号,并将标点符号添加到拆分生成的列表中.
例如:
<预><代码>>>>c = "帮助,我">>>打印 c.split()['帮我']我真正想要的列表是这样的:
['help', ',', 'me']
所以,我希望字符串在空格处拆分,标点符号与单词分开.
我尝试先解析字符串然后运行拆分:
<预><代码>>>>对于 c 中的字符:...如果字符在.,;!?"中:... outputCharacter = " %s" % 字符... 别的:... outputCharacter = 字符... 分隔标点 += 输出字符>>>打印分隔符帮我>>>打印 separatorPunctuation.split()['帮我']这产生了我想要的结果,但在大文件上速度很慢.
有没有办法更有效地做到这一点?
这或多或少是这样做的:
<预><代码>>>>进口重新>>>re.findall(r"[\w']+|[.,!?;]", "你好,我是一个字符串!")['你好', ',', "我是", 'a', 'string', '!']诀窍是,不要考虑在何处拆分字符串,而是考虑将哪些内容包含在标记中.
注意事项:
- 下划线 (_) 被视为内部单词字符.替换 \w,如果您不想要.
- 这不适用于字符串中的(单)引号.
- 在正则表达式的右半部分添加您要使用的任何其他标点符号.
- 本文中未明确提及的任何内容都会被悄悄删除.
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = "help, me"
>>> print c.split()
['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I've tried to parse the string first and then run the split:
>>> for character in c:
... if character in ".,;!?":
... outputCharacter = " %s" % character
... else:
... outputCharacter = character
... separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
This is more or less the way to do it:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
- The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
- This will not work with (single) quotes in the string.
- Put any additional punctuation marks you want to use in the right half of the regular expression.
- Anything not explicitely mentioned in the re is silently dropped.
这篇关于将字符串拆分为单词和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!