将HTML列表转换为嵌套的Python列表 [英] Converting HTML list to nested Python list

查看:137
本文介绍了将HTML列表转换为嵌套的Python列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个如下所示的嵌套html(无序)列表:

 < ul> 
< li>< a href =Page1_Level1.html> Page1_Level1< / a>
< ul>
< li>< a href =Page1_Level2.html> Page1_Level2< / a>
< ul>
< li>< a href =Page1_Level3.html> Page1_Level3< / a>< / li>
< / ul>
< ul>
< li>< a href =Page2_Level3.html> Page2_Level3< / a>< / li>
< / ul>
< ul>
< li>< a href =Page3_Level3.html> Page3_Level3< / a>< / li>
< / ul>
< / li>
< / ul>
< / li>
< li>< a href =Page2_Level1.html> Page2_Level1< / a>
< ul>
< li>< a href =Page2_Level2.html> Page2_Level2< / a>< / li>
< / ul>
< / li>
< / ul>

如何在Python中形成一个嵌套列表?例如:

  [Page1_Level1.html,[Page1_Level2.html,[Page1_Leve3.html,Page2_Level3 .html,Page3_Level3.html]],Page2_Level1.html,[Page2_Level2.html]] 

我设想像美丽的汤 HTML分析器这样的库有设施可以做到这一点,但我一直无法弄清楚。感谢您的任何帮助/指示!

解决方案

您可以采取递归方法:

  from pprint import pprint 
from bs4 import BeautifulSoup

text =您的html在这里
$ $ b $ def find_li(element):
return $ {$ li $ {$ li $'$'$:b
在li中,li('li',recursive = False)]


soup = BeautifulSoup(text,'html.parser')
data = find_li(soup)
pprint(data)

打印:

  [{u'Page1_Level1.html':[{u'Page1_Level2.html':[{u'Page1_Level3.html':[]},
{u' Page2_Level3.html':[]},
{u'Page3_Level3.html':[]}]}],
{u'Page2_Level1.html':[{u'Page2_Level2.html': []}]}]

仅供参考,这里是为什么我不得不在这里使用 html.parser


If I have a nested html (unordered) list that looks like this:

<ul>
    <li><a href="Page1_Level1.html">Page1_Level1</a> 
    <ul>
        <li><a href="Page1_Level2.html">Page1_Level2</a> 
            <ul>
                <li><a href="Page1_Level3.html">Page1_Level3</a></li>
            </ul>
            <ul>
                <li><a href="Page2_Level3.html">Page2_Level3</a></li>
            </ul>
            <ul>
                <li><a href="Page3_Level3.html">Page3_Level3</a></li>
            </ul>
        </li>
    </ul>
    </li>
    <li><a href="Page2_Level1.html">Page2_Level1</a> 
    <ul>
        <li><a href="Page2_Level2.html">Page2_Level2</a></li>
    </ul>
    </li>
</ul>

How do I form a nested list out of it in Python? For example:

["Page1_Level1.html", ["Page1_Level2.html", ["Page1_Leve3.html", "Page2_Level3.html", "Page3_Level3.html"]], "Page2_Level1.html", ["Page2_Level2.html"]]

I presume libraries like Beautiful Soup and HTML Parser have facilities to do this, but I haven't been able it figure it out. Thanks for any help / pointers!

解决方案

You can take a recursive approach:

from pprint import pprint
from bs4 import BeautifulSoup

text = """your html goes here"""

def find_li(element):
    return [{li.a['href']: find_li(li)}
            for ul in element('ul', recursive=False)
            for li in ul('li', recursive=False)]


soup = BeautifulSoup(text, 'html.parser')
data = find_li(soup)
pprint(data)

Prints:

[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
                                                 {u'Page2_Level3.html': []},
                                                 {u'Page3_Level3.html': []}]}]},
 {u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]

FYI, here is why I had to use html.parser here:

这篇关于将HTML列表转换为嵌套的Python列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆