在python中使用正则表达式捕获表情符号 [英] Capturing emoticons using regular expression in python
问题描述
我想要一个正则表达式模式来匹配笑脸 ":)" ,":(" .Also 它也应该捕获重复的笑脸,如 ":) :)" , ":) :(" 但过滤掉无效的语法,如":((" .
I would like to have a regex pattern to match smileys ":)" ,":(" .Also it should capture repeated smileys like ":) :)" , ":) :(" but filter out invalid syntax like ":( (" .
我有这个,但它匹配 ":( ("
I have this with me, but it matches ":( ("
bool( re.match("(:\()",str) )
我可能在这里遗漏了一些明显的东西,我想为这个看似简单的任务提供一些帮助.
I maybe missing something obvious here, and I'd like some help for this seemingly simple task.
推荐答案
我认为它终于点击"了您在这里问的问题.看看下面的内容:
I think it finally "clicked" exactly what you're asking about here. Take a look at the below:
import re
smiley_pattern = '^(:\(|:\))+$' # matches only the smileys ":)" and ":("
def test_match(s):
print 'Value: %s; Result: %s' % (
s,
'Matches!' if re.match(smiley_pattern, s) else 'Doesn\'t match.'
)
should_match = [
':)', # Single smile
':(', # Single frown
':):)', # Two smiles
':(:(', # Two frowns
':):(', # Mix of a smile and a frown
]
should_not_match = [
'', # Empty string
':(foo', # Extraneous characters appended
'foo:(', # Extraneous characters prepended
':( :(', # Space between frowns
':( (', # Extraneous characters and space appended
':((' # Extraneous duplicate of final character appended
]
print('The following should all match:')
for x in should_match: test_match(x);
print('') # Newline for output clarity
print('The following should all not match:')
for x in should_not_match: test_match(x);
您的原始代码的问题在于您的正则表达式错误:(:\()
.让我们分解一下.
The problem with your original code is that your regex is wrong: (:\()
. Let's break it down.
外面的括号是一个分组".如果您要进行字符串替换,它们就是您要引用的内容,并且用于一次对字符组应用正则表达式运算符.所以,你真的是在说:
The outside parentheses are a "grouping". They're what you'd reference if you were going to do a string replacement, and are used to apply regex operators on groups of characters at once. So, you're really saying:
(
开始一组:\(
...做正则表达式的东西...
(
begin a group:\(
... do regex stuff ...
:
不是正则表达式保留字符,所以它只是一个冒号.\
是,它的意思是下面的字符是文字,不是正则表达式".这称为转义序列".完全解析成英文,你的正则表达式说The
:
isn't a regex reserved character, so it's just a colon. The\
is, and it means "the following character is literal, not a regex operator". This is called an "escape sequence". Fully parsed into English, your regex says(
开始一组:
一个冒号字符\(
一个左括号字符
(
begin a group:
a colon character\(
a left parenthesis character
我使用的正则表达式稍微复杂一些,但还不错.让我们分解一下:
^(:\(|:\))+$
.The regex I used is slightly more complex, but not bad. Let's break it down:
^(:\(|:\))+$
.^
和$
分别表示行首"和行尾".现在我们有...^
and$
mean "the beginning of the line" and "the end of the line" respectively. Now we have ...^
行首(:\(|:\))+
... 做正则表达式 ...
^
beginning of line(:\(|:\))+
... do regex stuff ...
...所以它只匹配构成整行的内容,而不是简单地出现在字符串的中间.
... so it only matches things that comprise the entire line, not simply occur in the middle of the string.
我们知道
(
和)
表示一个分组.+
表示其中之一".现在我们有:We know that
(
and)
denote a grouping.+
means "one of more of these". Now we have:^
行首(
开始组:\(|:\)
... 做正则表达式 ...
^
beginning of line(
start a group:\(|:\)
... do regex stuff ...
最后是
|
(管道)运算符.它的意思是或".因此,应用我们从上文中了解的有关转义字符的知识,我们已准备好完成翻译:Finally, there's the
|
(pipe) operator. It means "or". So, applying what we know from above about escaping characters, we're ready to complete the translation:^
行首(
开始组:
一个冒号字符\(
一个左括号字符
^
beginning of line(
start a group:
a colon character\(
a left parenthesis character
:
一个冒号字符\)
一个右括号字符
:
a colon character\)
a right parenthesis character
我希望这会有所帮助.如果没有,请告诉我,我很乐意通过回复编辑我的答案.
I hope this helps. If not, let me know and I'll be happy to edit my answer with a reply.
这篇关于在python中使用正则表达式捕获表情符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!