解析与BeautifulSoup嵌套的HTML列表 [英] Parsing nested HTML list with BeautifulSoup

查看:845
本文介绍了解析与BeautifulSoup嵌套的HTML列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析嵌套的HTML列表,并将其转换为父子字典。鉴于此列表:

I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:

<ul>
  <li>Operating System
    <ul>
      <li>Linux
        <ul>
          <li>Debian</li>
          <li>Fedora</li>
          <li>Ubuntu</li>
        </ul>
      </li>
      <li>Windows</li>
      <li>OS X</li>
    </ul>
  </li>
  <li>Programming Languages
    <ul>
      <li>Python</li>
      <li>C#</li>
      <li>Ruby</li>
    </ul>
  </li>
</ul>

我想将其转换为这样的字典:

I want to convert it to a dict like this:

{
    'Operating System': {
        'Linux': {
            'Debian': None,
            'Fedora': None,
            'Ubuntu': None,
        },
        'Windows': None,
        'OS X': None,
    },
    'Programming Languages': {
        'Python': None,
        'C#': None,
        'Ruby': None,
    }
}

我最初的尝试是使用 find_all(礼,递归= FALSE)。它返回顶层项目(操作系统和编程语言),而且孩子们。

My initial attempt is using find_all('li', recursive=False). It returns the top level items (Operating System and Programming Languages) but also the children.

我如何与BeautifulSoup办呢?

How can I do it with BeautifulSoup?

推荐答案

下面是一种方法:

def dictify(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:
            result[key] = dictify(ul)
        else:
            result[key] = None
    return result

使用示例:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
...   <li>Operating System
...     <ul>
...       <li>Linux
...         <ul>
...           <li>Debian</li>
...           <li>Fedora</li>
...           <li>Ubuntu</li>
...         </ul>
...       </li>
...       <li>Windows</li>
...       <li>OS X</li>
...     </ul>
...   </li>
...   <li>Programming Languages
...     <ul>
...       <li>Python</li>
...       <li>C#</li>
...       <li>Ruby</li>
...     </ul>
...   </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
                                  u'Fedora': None,
                                  u'Ubuntu': None},
                       u'OS X': None,
                       u'Windows': None},
 u'Programming Languages': {u'C#': None,
                            u'Python': None,
                            u'Ruby': None}}

这篇关于解析与BeautifulSoup嵌套的HTML列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆