Python在包含静态文本的静态HTML标签之间刮取值 [英] Python scrape value between static HTML tags containing static text

查看:160
本文介绍了Python在包含静态文本的静态HTML标签之间刮取值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在这个论坛上的第一篇文章,我相信这个论坛会在这里回答我的基本问题。



我的要求包括两个步骤。


  1. 在第一步中,我需要根据标签跨度和类别c8和c2提取付费死亡通知的值,其中DOCUMENT-TYPE:文本是静态的,并且它始终存在于我的HTML中。



 < SPAN CLASS =c8> DOCUMENT-TYPE:< ; / SPAN>< SPAN CLASS =c2> Paid Death Notice< / SPAN>< / P> 

对于下面的html数据类似,我需要提取基于发布类型的报纸 span and class as c8 and c2

 < SPAN CLASS =c8>发布-TYPE:< / SPAN>< SPAN CLASS =c2>报纸< / SPAN> 

我试过的解决方案:

$ b $ p $ from bs4 import BeautifulSoup
import re

data =< SPAN CLASS =c8> DOCUMENT-类型:< / SPAN>< SPAN CLASS =c2> **付费死亡通知**< / SPAN>
< SPAN CLASS =" c8>>发布类型:< SPAN>< SPAN CLASS =c2>报纸< / SPAN>


汤= BeautifulSoup(数据,'lxml')
doc =汤。 find('span',class _ ='c8')
doctext = re.compile('< SPAN(。* DOCUMENT-TYPE:< / SPAN>< SPAN。*?)< / SPAN> ')
print(doctext.match(doc.text))

结果:

 

在哪里我只应得到付费死亡通知作为结果


    同样,可能有许多HTMl标签具有相同的DOCUMENT-TYPE:字段,它们只有值不同,所以在这种情况下,我将如何迭代基于什么条件循环?



 < SPAN CLASS =c8> DOCUMENT-TYPE:< / SPAN>< SPAN CLASS =c2>付款通知:死亡THORNTON,ROBERT< / SPAN> 

请帮我解决问题。



<注:我已经在网上搜索,并尝试了很多方法,但无法找到正确的解决方案,我终于在这里发帖,希望我可以为我的问题找到正确的解决方案。

解决方案

  import re 

data =< SPAN CLASS =c8> DOCUMENT < / SPAN>< SPAN CLASS =c2> **付费死亡通知**< SPAN>
< SPAN CLASS =" c8>>发布类型:< / SPAN>< SPAN CLASS =c2>报纸< / SPAN>
< SPAN CLASS =c8>文件类型:< / SPAN>< SPAN CLASS =c2> ;付费通知:死亡THORNTON,ROBERT< / SPAN>

pattern =\< SPAN CLASS = \c8 \\> DOCUMENT-TYPE: / SPAN>< SPAN CLASS = \c2 \\>(。*)\< / SPAN>
print [a.strip(*)for a re.findall(pattern,data)]

输出:

  ['付费死亡通知','付款通知:死亡THORNTON,ROBERT'] 


This is my first post in this forum and i believe that this forum would answer my basic question here.

My requirement here consists of two steps.

  1. In the first step, i need to extract the value "Paid Death Notice" based on the tag span and class c8 and c2 for the below html data where "DOCUMENT-TYPE:" text is static and it will always be there in my HTML.

<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>

similarly for the below html data, i need to extract "Newspaper" value based on "PUBLICATION TYPE" with span and class as c8 and c2

<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>

Solution i have tried:

from bs4 import BeautifulSoup
import re

data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
           <SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""


soup = BeautifulSoup(data,'lxml')
doc=soup.find('span',class_='c8')
doctext=re.compile('<SPAN(.*DOCUMENT-TYPE: </SPAN><SPAN.*?)</SPAN>')
print(doctext.match(doc.text))

Result:

None

Where i should get only Paid Death Notice as result

  1. Similarly there could be many HTMl tags having same DOCUMENT-TYPE: field where it differs by value only, so in this case, how will i iterate the loop based under what condition?

<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>

Pls help me to resolve the issue.

Note: I have searched in the web and tried many ways but cannot able to find right solution and i am finally posting here with the hope that i may get right solution for my question.

解决方案

import re

data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
           <SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
           <SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
           """
pattern="\<SPAN CLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPAN CLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]

Output:

['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']

这篇关于Python在包含静态文本的静态HTML标签之间刮取值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆