Python正则表达式-查找html标签之间的字符串 [英] Python Regex - find string between html tags
问题描述
我正在尝试提取HTML标记之间的字符串.我可以看到类似的问题以前曾在堆栈溢出中被问过,但是我对python完全陌生,并且正在苦苦挣扎.
I am trying to extract the string between Html tags. I can see that similar questions have been asked on stack overflow before, but I am completely new to python and I am struggling.
如果我有
<b>Bold Stuff</b>
我想拥有一个让我烦恼的正则表达式
I want to have a regular expression that leaves me with
Bold Stuff
但是到目前为止,我所有的解决方案都给我留下了类似的东西
But all of my solutions so far have left me with things like
>Bold Stuff<
在此方面,我将不胜感激.
I would really appreciate any help with this.
我有
>.*?<
我已经看到了有关堆栈溢出的问题以及建议的解决方法
And I have seen a question on stack overflow with suggested solution
>([^<>]*)<
但是这些都不对我有用.请有人解释如何写一个正则表达式,说找到字符x和y之间的字符串,不包括x和y".
But neither of these are working for me. Please could someone explain how to write a regex that says "find me the string between characters x and y not including x and y".
感谢您的帮助
推荐答案
>>> a = '<b>Bold Stuff</b>'
>>>
>>> import re
>>> re.findall(r'>(.+?)<', a)
['Bold Stuff']
>>> re.findall(r'>(.*?)<', a)[0] # non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.+?)<', a)[0] # or this, also is non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.*)<', a)[0] # greedy mode
'Bold Stuff'
>>>
这时,贪婪模式和非贪婪模式都可以工作.
At this point, both of greedy mode and non-greedy mode can work.
您正在使用第一个非贪婪模式.这是有关非贪婪模式和贪婪模式的示例:
You're using the first non-greedy mode. Here is an example about what about non-greedy mode and greedy mode:
>>> a = '<b>Bold <br> Stuff</b>'
>>> re.findall(r'>(.*?)<', a)[0]
'Bold '
>>> re.findall(r'>(.*)<', a)[0]
'Bold <br> Stuff'
>>>
这是关于 (...)
:
(...)
匹配括号内的任何正则表达式,并指示组的开始和结束;
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;
可以在执行匹配后检索组的内容,以后可以在字符串中使用\ number特殊序列进行匹配,如下所述.
the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.
要匹配文字(
或)
,请使用 \(
或 \)
,或将其括在其中字符类: [(] [)]
.
To match the literals (
or )
, use \(
or \)
, or enclose them inside a character class: [(] [)]
.
这篇关于Python正则表达式-查找html标签之间的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!