解析多个< div>的Beautiful Soup和连续的< p>标签放入字典 [英] Beautiful Soup parsing multiple <div> and successive <p> tags into dictionary

查看:52
本文介绍了解析多个< div>的Beautiful Soup和连续的< p>标签放入字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个内联div(是页眉"),下面是段落标签(而不是在div中),从理论上讲是子级" ...我想将其转换为字典.我不太清楚做到这一点的最佳方法.该网站的外观大致如下:

I have multiple inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

我正在使用的Python代码如下:

The Python code I have working looks like this:

soup = bs.BeautifulSoup(source,'lxml')

full_discussion = soup.find(attrs={'class' : 'field field-type-text field-field-discussion'})

ava_discussion = full_discussion.find(attrs = {'class': 'field-item odd'})

for div in ava_discussion.find_all("div"):
    discussion = []

    if div.findNextSibling('p'):
        discussion.append(div.findNextSibling('p').get_text())

    location = div.get_text()

    ava_dict.update({location: {"discussion": discussion}}

但是,问题是此代码仅添加了FIRST <p>标记,然后将其移至下一个div.最终,我想我想将每个<p>添加到discussion的列表中.救命!

However, the problem is that this code only adds the FIRST <p> tag, then it moves onto the next div. Ultimately, I think I'd like to add each <p> into a list into discussion. Help!

更新:

添加一个while循环会为我生成第一个

标记的副本,其中包含多少个副本.这是代码:

Adding a while loop yields me duplicates of the first

tags for how many exist. Here is the code:

for div in ava_discussion.find_all("div"):
    ns = div.nextSibling

    discussion = []

    while ns is not None and ns.name != "div":
        if ns.name == "p":
            discussion.append(div.findNextSibling('p').get_text())
        ns = ns.nextSibling

    location = div.get_text()

    ava_dict.update({location : {"discussion": discussion}})

print(json.dumps(ava_dict, indent=2))

推荐答案

我没有添加正确的文本.这段代码有效:

I wasn't adding the correct text. This code works:

for div in ava_discussion.find_all("div"):
    ns = div.nextSibling

    discussion = []

    while ns is not None and ns.name != "div":
        if ns.name == "p":
            discussion.append(ns.get_text())
        ns = ns.nextSibling

    location = div.get_text()

    ava_dict.update({location : {"discussion": discussion}})

print(json.dumps(ava_dict, indent=2))

这篇关于解析多个&lt; div&gt;的Beautiful Soup和连续的&lt; p&gt;标签放入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆