通过哪个库,如何通过其标题和段落标签在HTML上抓取文本? [英] By what library and how can I scrape texts on an HTML by its heading and paragraph tags?

查看:37
本文介绍了通过哪个库,如何通过其标题和段落标签在HTML上抓取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入将是没有固定HTML结构的任何Web文档.我想做的是提取标题(可能是嵌套的)及其后续段落标签(可能是多个)中的文本,并将它们成对输出.

My input will be any web documents that has no fixed HTML structure. What I want to do is to extract the texts in the heading (might be nested) and its following paragraph tags (might be multiple), and output them as pairs.

一个简单的HTML示例可以是:

A simple HTML example can be:

<h1>House rule</h1>
<h2>Rule 1</h2>
<p>A</p>
<p>B</p>
<h2>Rule 2</h2>
<h3>Rule 2.1</h3>
<p>C</p>
<h3>Rule 2.2</h3>
<p>D</p>

在此示例中,我希望输出对:

For this example, I would like to have a output of pairs:

Rule 2.2, D

Rule 2.1, C

Rule 2, D

Rule 2, C

House rule, D

House rule, C

Rule 1, A B

.....等等.

我是Python的初学者,我知道Scrapy和BeautifulSoup广泛进行了网络抓取,在这种情况下,它可能需要与XPath或代码相关的代码来识别同级标记.至于如何提取标题及其下段的输出对,显然是基于标签的相对顺序.我不确定在这种情况下哪个库会更好,如果您可以向我展示如何实现它,那将真的很有帮助.谢谢!

I am a beginner of Python, and I know the web scraping is widely done by Scrapy and BeautifulSoup, and it might require something to do with the XPath or code to identify sibling tags in this case. As how to extract the output pairs of the heading and its below paragraphs are obviously based on relative sequence of the tags. I am not sure which library will be better to use in this case, and it will be really helpful if you can show me how to achieve it. Thanks!

推荐答案

遍历树并收集所有< p> 标签,这些标签的< h> 标签可以通过BeautifulSoup完成:

Traversing the tree and collecting all the <p> tags that are under increasing levels of <h> tags can be done with BeautifulSoup:

html = '''
<h1>House rule</h1>
    <h2>Rule 1</h2>
        <p>A</p>
        <p>B</p>
    <h2>Rule 2</h2>
        <h3>Rule 2.1</h3>
            <p>C</p>
        <h3>Rule 2.2</h3>
            <p>D</p>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")

counter = 1
all_leafs = []
while True:
    htag = 'h%d'%counter
    hgroups =  soup.findAll(htag)
    print(htag,len(hgroups))
    counter += 1
    if len(hgroups) == 0: 
        break
    for hgroup in hgroups:
        for c,descendant in enumerate(hgroup.find_all_next()):
            name = getattr(descendant, "name", None)
            if name == 'p':
                all_leafs.append((hgroup.getText(),descendant.getText()))
print(all_leafs)

...

h1 1
h2 2
h3 2
h4 0
[('House rule', 'A'), ('House rule', 'B'), ('House rule', 'C'), ('House rule', 'D'), ('Rule 1', 'A'), ('Rule 1', 'B'), ('Rule 1', 'C'), ('Rule 1', 'D'), ('Rule 2', 'C'), ('Rule 2', 'D'), ('Rule 2.1', 'C'), ('Rule 2.1', 'D'), ('Rule 2.2', 'D')]

这篇关于通过哪个库,如何通过其标题和段落标签在HTML上抓取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆