解析与BeautifulSoup嵌套的HTML列表 [英] Parsing nested HTML list with BeautifulSoup
本文介绍了解析与BeautifulSoup嵌套的HTML列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要解析嵌套的HTML列表,并将其转换为父子字典。鉴于此列表:
I need to parse a nested HTML list and convert it to a parent-child dict. Given this list:
<ul>
<li>Operating System
<ul>
<li>Linux
<ul>
<li>Debian</li>
<li>Fedora</li>
<li>Ubuntu</li>
</ul>
</li>
<li>Windows</li>
<li>OS X</li>
</ul>
</li>
<li>Programming Languages
<ul>
<li>Python</li>
<li>C#</li>
<li>Ruby</li>
</ul>
</li>
</ul>
我想将其转换为这样的字典:
I want to convert it to a dict like this:
{
'Operating System': {
'Linux': {
'Debian': None,
'Fedora': None,
'Ubuntu': None,
},
'Windows': None,
'OS X': None,
},
'Programming Languages': {
'Python': None,
'C#': None,
'Ruby': None,
}
}
我最初的尝试是使用 find_all(礼,递归= FALSE)
。它返回顶层项目(操作系统和编程语言),而且孩子们。
My initial attempt is using find_all('li', recursive=False)
. It returns the top level items (Operating System and Programming Languages) but also the children.
我如何与BeautifulSoup办呢?
How can I do it with BeautifulSoup?
推荐答案
下面是一种方法:
def dictify(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = dictify(ul)
else:
result[key] = None
return result
使用示例:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
... <li>Operating System
... <ul>
... <li>Linux
... <ul>
... <li>Debian</li>
... <li>Fedora</li>
... <li>Ubuntu</li>
... </ul>
... </li>
... <li>Windows</li>
... <li>OS X</li>
... </ul>
... </li>
... <li>Programming Languages
... <ul>
... <li>Python</li>
... <li>C#</li>
... <li>Ruby</li>
... </ul>
... </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
u'Fedora': None,
u'Ubuntu': None},
u'OS X': None,
u'Windows': None},
u'Programming Languages': {u'C#': None,
u'Python': None,
u'Ruby': None}}
这篇关于解析与BeautifulSoup嵌套的HTML列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文