Python在包含静态文本的静态HTML标签之间刮取值 [英] Python scrape value between static HTML tags containing static text

查看：160 发布时间：2018/6/26 10:21:45 python html web-scraping

本文介绍了Python在包含静态文本的静态HTML标签之间刮取值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我在这个论坛上的第一篇文章，我相信这个论坛会在这里回答我的基本问题。

我的要求包括两个步骤。

在第一步中，我需要根据标签跨度和类别c8和c2提取付费死亡通知的值，其中DOCUMENT-TYPE：文本是静态的，并且它始终存在于我的HTML中。

 < SPAN CLASS =c8> DOCUMENT-TYPE：< ; / SPAN>< SPAN CLASS =c2> Paid Death Notice< / SPAN>< / P>

对于下面的html数据类似，我需要提取基于发布类型的报纸 span and class as c8 and c2

 < SPAN CLASS =c8>发布-TYPE：< / SPAN>< SPAN CLASS =c2>报纸< / SPAN>

我试过的解决方案：

$ b $ p $ from bs4 import BeautifulSoup import re data = DOCUMENT-类型： **付费死亡通知** >发布类型：报纸 汤= BeautifulSoup（数据，'lxml'） doc =汤。 find（'span'，class _ ='c8'） doctext = re.compile（'< SPAN（。* DOCUMENT-TYPE：< SPAN。*？） '） print（doctext.match（doc.text））

结果：

无

在哪里我只应得到付费死亡通知作为结果

同样，可能有许多HTMl标签具有相同的DOCUMENT-TYPE：字段，它们只有值不同，所以在这种情况下，我将如何迭代基于什么条件循环？

 < SPAN CLASS =c8> DOCUMENT-TYPE：< / SPAN>< SPAN CLASS =c2>付款通知：死亡THORNTON，ROBERT< / SPAN>

请帮我解决问题。

<注：我已经在网上搜索，并尝试了很多方法，但无法找到正确的解决方案，我终于在这里发帖，希望我可以为我的问题找到正确的解决方案。
解决方案

import re data = DOCUMENT **付费死亡通知** >发布类型：报纸 文件类型： ;付费通知：死亡THORNTON，ROBERT pattern =\ DOCUMENT-TYPE： / SPAN>（。*）\ print [a.strip（*）for a re.findall（pattern，data）]
输出：
['付费死亡通知'，'付款通知：死亡THORNTON，ROBERT']

This is my first post in this forum and i believe that this forum would answer my basic question here.

My requirement here consists of two steps.

In the first step, i need to extract the value "Paid Death Notice" based on the tag span and class c8 and c2 for the below html data where "DOCUMENT-TYPE:" text is static and it will always be there in my HTML.

DOCUMENT-TYPE: Paid Death Notice
similarly for the below html data, i need to extract "Newspaper" value based on "PUBLICATION TYPE" with span and class as c8 and c2
PUBLICATION-TYPE: Newspaper
Solution i have tried:
from bs4 import BeautifulSoup import re data = """DOCUMENT-TYPE: **Paid Death Notice** PUBLICATION-TYPE: Newspaper""" soup = BeautifulSoup(data,'lxml') doc=soup.find('span',class_='c8') doctext=re.compile('<SPAN(.*DOCUMENT-TYPE: <SPAN.*?)') print(doctext.match(doc.text))
Result:
None
Where i should get only Paid Death Notice as result

Similarly there could be many HTMl tags having same DOCUMENT-TYPE: field where it differs by value only, so in this case, how will i iterate the loop based under what condition?

DOCUMENT-TYPE: Paid Notice: Deaths THORNTON, ROBERT
Pls help me to resolve the issue.

Note: I have searched in the web and tried many ways but cannot able to find right solution and i am finally posting here with the hope that i may get right solution for my question.
解决方案
import re data = """DOCUMENT-TYPE: **Paid Death Notice** PUBLICATION-TYPE: Newspaper DOCUMENT-TYPE: Paid Notice: Deaths THORNTON, ROBERT """ pattern="\DOCUMENT-TYPE: (.*)\" print [a.strip("*") for a in re.findall(pattern,data)]
Output:
['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']

这篇关于Python在包含静态文本的静态HTML标签之间刮取值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python在包含静态文本的静态HTML标签之间刮取值 [英] Python scrape value between static HTML tags containing static text

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python在包含静态文本的静态HTML标签之间刮取值 [英] Python scrape value between static HTML tags containing static text

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭