Python正则表达式findall [英] Python regex findall
问题描述
我正在尝试使用 Python 2.7.2 中的正则表达式从字符串中提取所有出现的标记词.或者简单地说,我想提取 [p][/p]
标签内的每一段文本.这是我的尝试:
I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p]
tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
打印person
产生['President [P]', '[/P]', '[P] Bill Gates [/P]']
正确的正则表达式是什么:['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
或 ['Barrack Obama', 'Bill Gates']
.
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates']
.
推荐答案
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
收益
['Barack Obama', 'Bill Gates']
<小时>
正则表达式 ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
完全一样unicode 为 u'[[1P].+?[/P]]+?'
除非更难阅读.
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
is exactly the same
unicode as u'[[1P].+?[/P]]+?'
except harder to read.
第一个括号组 [[1P]
告诉 re 列表中的任何字符 ['[', '1', 'P']
应该匹配,与第二个括号组 [/P] 类似]
.那根本不是你想要的.所以,
The first bracketed group [[1P]
tells re that any of the characters in the list ['[', '1', 'P']
should match, and similarly with the second bracketed group [/P]]
.That's not what you want at all. So,
- 去掉外面的方括号.(同时删除在
P
前面散落1
.) - 要保护
[P]
中的文字括号,请使用反斜杠:\[P\]
. - 要仅返回标签内的单词,请放置分组括号围绕
.+?
.
- Remove the outer enclosing square brackets. (Also remove the
stray
1
in front ofP
.) - To protect the literal brackets in
[P]
, escape the brackets with a backslash:\[P\]
. - To return only the words inside the tags, place grouping parentheses
around
.+?
.
这篇关于Python正则表达式findall的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!