使用BeautifulSoup来查找所有的"ul".和"li"元素 [英] Using BeautifulSoup in order to find all "ul" and "li" elements

查看:119
本文介绍了使用BeautifulSoup来查找所有的"ul".和"li"元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用Python编写爬网脚本,我想将以下HTML响应映射到多列表或字典中(没关系).

I'm currently working on a crawling-script in Python where I want to map the following HTML-response into a multilist or a dictionary (it does not matter).

我当前的代码是:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

req     = Request("https://my.site.com/crawl", headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req)
soup    = BeautifulSoup(webpage, 'html.parser')
ul      = soup.find('ul', {'class': ''})

运行此命令后,我得到以下结果存储在 ul 中:

After running this I get the following result stored in ul:

<ul>
    <li><a class="reference" href="#ref1">Data1</a></li>
    <li><a class="reference" href="#ref2">Data2</a>
        <ul>
            <li><a class="reference" href="#ref3">Data3</a></li>
            <li><a class="reference" href="#ref4">Data4</a>
                <ul>
                    <li><a class="reference" href="#ref5"><span class="pre">Data5</span></a></li>
                    <li><a class="reference" href="#ref6"><span class="pre">Data6</span></a></li>
                    .
                    .
                    .
                </ul>
            </li>
        </ul>
    </li>
    <li><a class="reference" href="#ref7">Data7</a>
        <ul>
            <li><a class="reference" href="#ref8"><span class="pre">Data8</span></a></li>
            <li><a class="reference" href="#ref9"><span class="pre">Data9</span></a></li>
            .
            .
            .
        </ul>
    </li>
    <li><a class="reference" href="#ref10">Data10</a>
        <ul>
            <li><a class="reference" href="#ref11"><span class="pre">Data11</span></a></li>
            <li><a class="reference" href="#ref12">Data12</a></li>
        </ul>
    </li>
</ul>

因为这是一个外部站点,所以我无法控制列表中元素的ID或类.

As this is an external site I cannot control the id or class of the elements in the list.

似乎我无法解决这个问题,是否有一种简单的方法可以将数据排列到列表或字典中?:

It seems that I can not get my head around this, is there a simple way to arrange the data into a list or dict?:

dict = {'Data1': {'href': 'ref1'}, 
        'Data2': {'href': 'ref2', {
                  'Data3': {'href': 'ref3'}, 
                  'Data4': {'href': 'ref4', {
                            'Data5': {'href': 'ref5'},
                            'Data6': {'href': 'ref6'},
                                    .
                                    .
                                    .                }
                                    }
                       }
               }
       }

我确实觉得这是一个繁琐的过程,但是我看不到有其他方法.

I do feel like this is a cumbersome process, however I do not see any other way of doing it.

任何帮助我朝正确方向前进的帮助都将受到赞赏!

Any help to get me going in the right direction is much appreciated!

干杯!

推荐答案

只需递归 ul 元素,即可提取所有具有文本的 li 元素的文本,如果有< ul> 元素代替,则更深层次地递归:

Just recurse the ul element, pulling out the text of all the li elements that have text, recursing deeper if there is a <ul> element instead:

def parse_ul(elem):
    result = {}
    for sub in elem.find_all('li', recursive=False):
        if sub.a is None:
            continue
        data = {k: v for k, v in sub.a.attrs.items() if k != 'class'}
        if sub.ul is not None:
            # recurse down
            data['children'] = parse_ul(sub.ul)
        result[sub.a.get_text(strip=True)] = data
    return result

这将使用所有直接的 li 元素;如果有一个< a> 元素,则该锚元素的文本将变成一个键,并且我们将标记属性的副本存储为值(忽略任何 class 属性).如果 a 标记旁边还有 < ul> 元素,则将对其进行递归解析并作为 children 键指向< a> 标记的属性字典.

This takes all direct li elements; if there is an <a> element the text of that anchor element is turned into a key and we store a copy of the tag attributes as the value (ignoring any class attributes). If there is also a <ul> element next to the a tag, it is parsed recursively and added as a children key to the attribute dictionary for the <a> tag.

对于您的示例输入,将产生:

For your sample input, this produces:

>>> from pprint import pprint    
>>> pprint(parse_ul(soup.ul))
{'Data1': {'href': '#ref1'},
 'Data10': {'children': {'Data11': {'href': '#ref11'},
                         'Data12': {'href': '#ref12'}},
            'href': '#ref10'},
 'Data2': {'children': {'Data3': {'href': '#ref3'},
                        'Data4': {'children': {'Data5': {'href': '#ref5'},
                                               'Data6': {'href': '#ref6'}},
                                  'href': '#ref4'}},
           'href': '#ref2'},
 'Data7': {'children': {'Data8': {'href': '#ref8'}, 'Data9': {'href': '#ref9'}},
           'href': '#ref7'}}

这篇关于使用BeautifulSoup来查找所有的"ul".和"li"元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆