Python字符串操作,提取html标签之间的文本 [英] Python string operation, extract text between html tags

查看:1008
本文介绍了Python字符串操作,提取html标签之间的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串:

 < font face =ARIAL,HELVETICAsize = -  2> 
JUL 28< / font>

(它输出两行,所以必须有\\\



我希望提取< font>< / font> 标签之间的字符串,在这种情况下,它是JUL 28,但它可能是另一个日期或其他数字。



1)从字体标签之间提取值的最佳方法是什么?我想我可以提取> < / 之间的所有内容。



编辑:移除第二个问题。

解决方案

尽管可以通过常规解析任意HTML表达式,它通常是一个死亡陷阱。有很多用于解析HTML的工具,包括 BeautifulSoup ,它是一个可以处理 broken 以及良好HTML的Python库。

;>>>从BeautifulSoup导入BeautifulSoup as BSHTML
>>>> BS = BSHTML(
...< font face =ARIAL,HELVETICAsize = - 2 >
... JUL 28< / font>
...)
>>> BS.font.contents [0] .strip()
u'JUL 28'

然后你只需要解析日期:

 >>>>> datetime.strptime(BS.font.contents [0] .strip(),'% b%d')
>>> datetime.datetime(1900,7,28,0,0)
datetime.datetime(1900,7,28,0,0)


I have a string:

<font face="ARIAL,HELVETICA" size="-2">  
JUL 28         </font>

(it outputs over two lines, so there must be a \n in there.

I wish to extract the string that's in between the <font></font> tags. In this case, it's JUL 28, but it might be another date or some other number.

1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between "> and </.

edit: second question removed.

解决方案

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.

>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">  
... JUL 28         </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'

Then you just need to parse the date:

>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)

这篇关于Python字符串操作,提取html标签之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆