访问下一个兄弟 <li>带有 BeautifulSoup 的元素 [英] Access next sibling &lt;li&gt; element with BeautifulSoup

查看:23
本文介绍了访问下一个兄弟 <li>带有 BeautifulSoup 的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对使用 Python/BeautifulSoup 进行网络解析完全陌生.我有一个包含(部分)代码的 HTML,如下所示:

<ul><li class="active"><a href="example.com">示例</a></li><li><a href="example.com">示例</a></li><li><a href="example1.com">示例 1</a></li><li><a href="example2.com">示例 2</a></li>

我必须访问每个链接(基本上是每个

  • 元素),直到没有更多的
  • 标签出现.每次点击链接时,其对应的
  • 元素都会将 class 设为active".我的代码是:

    from bs4 import BeautifulSoup导入 urllib2进口重新landPage = urllib2.urlopen('somepage.com').read()汤 = BeautifulSoup(登陆页面)pageList = soup.find("div", {"id": "pages"})page = pageList.find("li", {"class": "active"})

    此代码为我提供了列表中的第一个

  • 项目.我的逻辑是我一直在检查 next_sibling 是否不是 None.如果它不是 None,我正在创建一个 HTTP 请求到 <a> 标签的 href 属性在该兄弟
  • >.这将使我进入下一页,依此类推,直到没有更多页面.

    但我不知道如何获得上面给出的 page 变量的 next_sibling.是 page.next_sibling.get("href") 还是类似的东西?我查看了文档,但不知何故找不到它.有人可以帮忙吗?

    解决方案

    使用 find_next_sibling() 并明确说明您要查找哪个兄弟元素:

    next_li_element = page.find_next_sibling("li")

    如果 page 对应于最后一个活动的 li

    next_li_element 将变为 None:

    如果 next_li_element 是 None:# 没有更多的页面要走

    I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:

    <div id="pages">
        <ul>
            <li class="active"><a href="example.com">Example</a></li>
            <li><a href="example.com">Example</a></li>
            <li><a href="example1.com">Example 1</a></li>
            <li><a href="example2.com">Example 2</a></li>
        </ul>
    </div>
    

    I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:

    from bs4 import BeautifulSoup
    import urllib2
    import re
    
    landingPage = urllib2.urlopen('somepage.com').read()
    soup = BeautifulSoup(landingPage)
    
    pageList = soup.find("div", {"id": "pages"})
    
    page = pageList.find("li", {"class": "active"})
    

    This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.

    But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?

    解决方案

    Use find_next_sibling() and be explicit about what sibling element do you want to find:

    next_li_element = page.find_next_sibling("li")
    

    next_li_element would become None if the page corresponds to the last active li:

    if next_li_element is None:
        # no more pages to go
    

    这篇关于访问下一个兄弟 <li>带有 BeautifulSoup 的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆