Python字符串操作,提取html标签之间的文本 [英] Python string operation, extract text between html tags
问题描述
我有一个字符串:
< font face =ARIAL,HELVETICAsize = - 2>
JUL 28< / font>
(它输出两行,所以必须有\\\
。
我希望提取< font>< / font>
标签之间的字符串,在这种情况下,它是JUL 28,但它可能是另一个日期或其他数字。
1)从字体标签之间提取值的最佳方法是什么?我想我可以提取
>
和< /
之间的所有内容。 编辑:移除第二个问题。
尽管可以通过常规解析任意HTML表达式,它通常是一个死亡陷阱。有很多用于解析HTML的工具,包括 BeautifulSoup ,它是一个可以处理 broken 以及良好HTML的Python库。
;>>>从BeautifulSoup导入BeautifulSoup as BSHTML>>>> BS = BSHTML(
...< font face =ARIAL,HELVETICAsize = - 2 >
... JUL 28< / font>
...)
>>> BS.font.contents [0] .strip()
u'JUL 28'
然后你只需要解析日期:
>>>>> datetime.strptime(BS.font.contents [0] .strip(),'% b%d')
>>> datetime.datetime(1900,7,28,0,0)
datetime.datetime(1900,7,28,0,0)
I have a string:
<font face="ARIAL,HELVETICA" size="-2">
JUL 28 </font>
(it outputs over two lines, so there must be a \n in there.
I wish to extract the string that's in between the <font></font>
tags. In this case, it's JUL 28, but it might be another date or some other number.
1) The best way to extract the value from between the font tags? I was thinking I could extract everything in between ">
and </
.
edit: second question removed.
While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well.
>>> from BeautifulSoup import BeautifulSoup as BSHTML
>>> BS = BSHTML("""
... <font face="ARIAL,HELVETICA" size="-2">
... JUL 28 </font>"""
... )
>>> BS.font.contents[0].strip()
u'JUL 28'
Then you just need to parse the date:
>>> datetime.strptime(BS.font.contents[0].strip(), '%b %d')
>>> datetime.datetime(1900, 7, 28, 0, 0)
datetime.datetime(1900, 7, 28, 0, 0)
这篇关于Python字符串操作,提取html标签之间的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!