将 HTML 列表转换为嵌套的 Python 列表 [英] Converting HTML list to nested Python list
本文介绍了将 HTML 列表转换为嵌套的 Python 列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如果我有一个如下所示的嵌套 html(无序列表)列表:
If I have a nested html (unordered) list that looks like this:
<ul>
<li><a href="Page1_Level1.html">Page1_Level1</a>
<ul>
<li><a href="Page1_Level2.html">Page1_Level2</a>
<ul>
<li><a href="Page1_Level3.html">Page1_Level3</a></li>
</ul>
<ul>
<li><a href="Page2_Level3.html">Page2_Level3</a></li>
</ul>
<ul>
<li><a href="Page3_Level3.html">Page3_Level3</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="Page2_Level1.html">Page2_Level1</a>
<ul>
<li><a href="Page2_Level2.html">Page2_Level2</a></li>
</ul>
</li>
</ul>
如何在 Python 中形成嵌套列表?例如:
How do I form a nested list out of it in Python? For example:
["Page1_Level1.html", ["Page1_Level2.html", ["Page1_Leve3.html", "Page2_Level3.html", "Page3_Level3.html"]], "Page2_Level1.html", ["Page2_Level2.html"]]
我认为像 Beautiful Soup 和 HTML Parser 之类的库具有执行此操作的功能,但我无法弄清楚.感谢您的帮助/指点!
I presume libraries like Beautiful Soup and HTML Parser have facilities to do this, but I haven't been able it figure it out. Thanks for any help / pointers!
推荐答案
你可以采用递归的方法:
You can take a recursive approach:
from pprint import pprint
from bs4 import BeautifulSoup
text = """your html goes here"""
def find_li(element):
return [{li.a['href']: find_li(li)}
for ul in element('ul', recursive=False)
for li in ul('li', recursive=False)]
soup = BeautifulSoup(text, 'html.parser')
data = find_li(soup)
pprint(data)
打印:
[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
{u'Page2_Level3.html': []},
{u'Page3_Level3.html': []}]}]},
{u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]
仅供参考,这就是我必须在此处使用 html.parser
的原因:
FYI, here is why I had to use html.parser
here:
这篇关于将 HTML 列表转换为嵌套的 Python 列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文