将字符串拆分为单词和标点符号 [英] Splitting a string into words and punctuation

查看:74
本文介绍了将字符串拆分为单词和标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将字符串拆分为单词和标点符号,并将标点符号添加到拆分生成的列表中.

例如:

<预><代码>>>>c = "帮助,我">>>打印 c.split()['帮我']

我真正想要的列表是这样的:

['help', ',', 'me']

所以,我希望字符串在空格处拆分,标点符号与单词分开.

我尝试先解析字符串然后运行拆分:

<预><代码>>>>对于 c 中的字符:...如果字符在.,;!?"中:... outputCharacter = " %s" % 字符... 别的:... outputCharacter = 字符... 分隔标点 += 输出字符>>>打印分隔符帮我>>>打印 separatorPunctuation.split()['帮我']

这产生了我想要的结果,但在大文件上速度很慢.

有没有办法更有效地做到这一点?

解决方案

这或多或少是这样做的:

<预><代码>>>>进口重新>>>re.findall(r"[\w']+|[.,!?;]", "你好,我是一个字符串!")['你好', ',', "我是", 'a', 'string', '!']

诀窍是,不要考虑在何处拆分字符串,而是考虑将哪些内容包含在标记中.

注意事项:

  • 下划线 (_) 被视为内部单词字符.替换 \w,如果您不想要.
  • 这不适用于字符串中的(单)引号.
  • 在正则表达式的右半部分添加您要使用的任何其他标点符号.
  • 本文中未明确提及的任何内容都会被悄悄删除.

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

解决方案

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

  • The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
  • This will not work with (single) quotes in the string.
  • Put any additional punctuation marks you want to use in the right half of the regular expression.
  • Anything not explicitely mentioned in the re is silently dropped.

这篇关于将字符串拆分为单词和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆