将字符串拆分为列表,保留重音字符和表情,但删除标点符号 [英] Split a string into a list, leaving accented chars and emoticons but removing punctuation
问题描述
如果我有字符串:
"O João foi almoçar :) ."
我如何最好地将其拆分成python中的单词列表,如下所示:
how do i best split it into a list of words in python like so:
['O','João', 'foi', 'almoçar', ':)']
?
谢谢:)
索非亚
推荐答案
如果像您的示例一样,标点符号属于其自己的以空格分隔的标记,那么这很容易:
If the punctuation falls into its own space-separated token as with your example, then it's easy:
>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
如果不是这种情况,您可以定义一个这样的笑脸字典(您需要添加更多):
If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):
d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
,然后将每个笑脸实例替换为不包含标点符号的占位符(我们认为<>
不是标点符号):
and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider <>
not to be punctuation):
for smiley, placeholder in d.iteritems():
s = s.replace(smiley, placeholder)
哪个让我们进入"O João foi almoçar <HAPPY_SMILEY> ."
.
然后我们删除标点符号:
We then strip punctuation:
s = ''.join(filter(lambda c: c not in '.,!', list(s)))
哪个给了我们"O João foi almoçar <HAPPY_SMILEY>"
.
我们确实恢复了笑脸:
for smiley, placeholder in d.iteritems():
s = s.replace(placeholder, smiley)
然后我们拆分:
s = s.split()
将最终结果提供给我们:['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
.
Giving us our final result: ['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']
.
将所有内容放到一个函数中
Putting it all together into a function:
def split_special(s):
d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
for smiley, placeholder in d.iteritems():
s = s.replace(smiley, placeholder)
s = ''.join(filter(lambda c: c not in '.,!', list(s)))
for smiley, placeholder in d.iteritems():
s = s.replace(placeholder, smiley)
return s.split()
这篇关于将字符串拆分为列表,保留重音字符和表情,但删除标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!