Python在包含静态文本的静态HTML标签之间刮取值 [英] Python scrape value between static HTML tags containing static text
问题描述
这是我在这个论坛上的第一篇文章,我相信这个论坛会在这里回答我的基本问题。
我的要求包括两个步骤。
- 在第一步中,我需要根据标签跨度和类别c8和c2提取付费死亡通知的值,其中DOCUMENT-TYPE:文本是静态的,并且它始终存在于我的HTML中。
< SPAN CLASS =c8> DOCUMENT-TYPE:< ; / SPAN>< SPAN CLASS =c2> Paid Death Notice< / SPAN>< / P>
对于下面的html数据类似,我需要提取基于发布类型的报纸 span and class as c8 and c2
< SPAN CLASS =c8>发布-TYPE:< / SPAN>< SPAN CLASS =c2>报纸< / SPAN>
我试过的解决方案:
$ b $ p $ from bs4 import BeautifulSoup
import re
data =< SPAN CLASS =c8> DOCUMENT-类型:< / SPAN>< SPAN CLASS =c2> **付费死亡通知**< / SPAN>
< SPAN CLASS =" c8>>发布类型:< SPAN>< SPAN CLASS =c2>报纸< / SPAN>
汤= BeautifulSoup(数据,'lxml')
doc =汤。 find('span',class _ ='c8')
doctext = re.compile('< SPAN(。* DOCUMENT-TYPE:< / SPAN>< SPAN。*?)< / SPAN> ')
print(doctext.match(doc.text))
结果:
无
在哪里我只应得到付费死亡通知作为结果
同样,可能有许多HTMl标签具有相同的DOCUMENT-TYPE:字段,它们只有值不同,所以在这种情况下,我将如何迭代基于什么条件循环?
< SPAN CLASS =c8> DOCUMENT-TYPE:< / SPAN>< SPAN CLASS =c2>付款通知:死亡THORNTON,ROBERT< / SPAN>
请帮我解决问题。
<注:我已经在网上搜索,并尝试了很多方法,但无法找到正确的解决方案,我终于在这里发帖,希望我可以为我的问题找到正确的解决方案。
import re
data =< SPAN CLASS =c8> DOCUMENT < / SPAN>< SPAN CLASS =c2> **付费死亡通知**< SPAN>
< SPAN CLASS =" c8>>发布类型:< / SPAN>< SPAN CLASS =c2>报纸< / SPAN>
< SPAN CLASS =c8>文件类型:< / SPAN>< SPAN CLASS =c2> ;付费通知:死亡THORNTON,ROBERT< / SPAN>
pattern =\< SPAN CLASS = \c8 \\> DOCUMENT-TYPE: / SPAN>< SPAN CLASS = \c2 \\>(。*)\< / SPAN>
print [a.strip(*)for a re.findall(pattern,data)]
输出:
['付费死亡通知','付款通知:死亡THORNTON,ROBERT']
This is my first post in this forum and i believe that this forum would answer my basic question here.
My requirement here consists of two steps.
- In the first step, i need to extract the value "Paid Death Notice" based on the tag span and class c8 and c2 for the below html data where "DOCUMENT-TYPE:" text is static and it will always be there in my HTML.
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
similarly for the below html data, i need to extract "Newspaper" value based on "PUBLICATION TYPE" with span and class as c8 and c2
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
Solution i have tried:
from bs4 import BeautifulSoup
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""
soup = BeautifulSoup(data,'lxml')
doc=soup.find('span',class_='c8')
doctext=re.compile('<SPAN(.*DOCUMENT-TYPE: </SPAN><SPAN.*?)</SPAN>')
print(doctext.match(doc.text))
Result:
None
Where i should get only Paid Death Notice as result
- Similarly there could be many HTMl tags having same DOCUMENT-TYPE: field where it differs by value only, so in this case, how will i iterate the loop based under what condition?
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
Pls help me to resolve the issue.
Note: I have searched in the web and tried many ways but cannot able to find right solution and i am finally posting here with the hope that i may get right solution for my question.
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
"""
pattern="\<SPAN CLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPAN CLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]
Output:
['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']
这篇关于Python在包含静态文本的静态HTML标签之间刮取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!