Python正则表达式-查找html标签之间的字符串 [英] Python Regex - find string between html tags

查看:107
本文介绍了Python正则表达式-查找html标签之间的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取HTML标记之间的字符串.我可以看到类似的问题以前曾在堆栈溢出中被问过,但是我对python完全陌生,并且正在苦苦挣扎.

I am trying to extract the string between Html tags. I can see that similar questions have been asked on stack overflow before, but I am completely new to python and I am struggling.

如果我有

<b>Bold Stuff</b>

我想拥有一个让我烦恼的正则表达式

I want to have a regular expression that leaves me with

Bold Stuff

但是到目前为止,我所有的解决方案都给我留下了类似的东西

But all of my solutions so far have left me with things like

>Bold Stuff<

在此方面,我将不胜感激.

I would really appreciate any help with this.

我有

>.*?<

我已经看到了有关堆栈溢出的问题以及建议的解决方法

And I have seen a question on stack overflow with suggested solution

>([^<>]*)<

但是这些都不对我有用.请有人解释如何写一个正则表达式,说找到字符x和y之间的字符串,不包括x和y".

But neither of these are working for me. Please could someone explain how to write a regex that says "find me the string between characters x and y not including x and y".

感谢您的帮助

推荐答案

>>> a = '<b>Bold Stuff</b>'
>>> 
>>> import re
>>> re.findall(r'>(.+?)<', a)
['Bold Stuff']
>>> re.findall(r'>(.*?)<', a)[0] # non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.+?)<', a)[0] # or this, also is non-greedy mode
'Bold Stuff'
>>> re.findall(r'>(.*)<', a)[0] # greedy mode
'Bold Stuff'
>>> 

这时,贪婪模式和非贪婪模式都可以工作.

At this point, both of greedy mode and non-greedy mode can work.

您正在使用第一个非贪婪模式.这是有关非贪婪模式和贪婪模式的示例:

You're using the first non-greedy mode. Here is an example about what about non-greedy mode and greedy mode:

>>> a = '<b>Bold <br> Stuff</b>'
>>> re.findall(r'>(.*?)<', a)[0]
'Bold '
>>> re.findall(r'>(.*)<', a)[0]
'Bold <br> Stuff'
>>> 

这是关于 (...) :

(...)

匹配括号内的任何正则表达式,并指示组的开始和结束;

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;

可以在执行匹配后检索组的内容,以后可以在字符串中使用\ number特殊序列进行匹配,如下所述.

the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below.

要匹配文字(),请使用 \( \),或将其括在其中字符类: [(] [)] .

To match the literals ( or ), use \( or \), or enclose them inside a character class: [(] [)].

这篇关于Python正则表达式-查找html标签之间的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆