解析内联< div>的Beautiful Soup和< p>进入字典 [英] Beautiful Soup parsing inline <div> and <p> into dictionary

查看:87
本文介绍了解析内联< div>的Beautiful Soup和< p>进入字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一个非常讨厌的网站.基本上,有内联div(是页眉")和下面的段落标签(不是在div中),从理论上讲是子级" ...我想将其转换为字典.我不太清楚做到这一点的最佳方法.该网站的外观大致如下:

I'm working on parsing a pretty nasty site. Basically, there is inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

因此,从理论上(并且错误地),python代码将像这样...

So, theoretically (and incorrectly), the python code would go something like this...

import bs4 as bs   

dict = {"Key" : "Value"}


soup = bs.BeautifulSoup(source,'lxml')
for item in soup:
    if item.tag == "div":
        dict['key'] = item.text
        if item.tag == "p":
            dict['value'] = item.text

但是以某种方式,一旦找到下一个<div>,它就需要中断,并开始一个新的键值.我很难缠这个...帮助!

But then somehow, once the next <div> is found, it needs to break, and start a new key value. I'm having such a hard time wrapping my head around this... Help!

更新 建议的解决方案效果很好.

UPDATE The suggested solution worked beautifully.

推荐答案

您可以先找出所有div,然后遍历div列表,为每个div找出其下一个同级标记p文本,向find_all函数添加更多属性约束,以确保它到达您想要的位置:

You can firstly find out all the divs, then loop through the div list, for each div find out its next sibling tag p's text, add more attribute constraints to the find_all function to make sure it gets to where you want it to be:

{div.get_text(): div.findNextSibling('p').get_text() for div in soup.find_all("div")}

#{'This should be dict key1': 'This should be the value of key1',
# 'This should be dict key2': 'This should be the value of key2'}


更新:如果div之后有多个p标记,则只需循环遍历所有div并找出所有p直到下一个div并将它们添加为值到上一个键,这里使用defaultdict简化了逻辑:


Update: if there are multiple p tags following div, then simply loop through all divs and find out all ps until the next div and add them as values to the previous key, here used a defaultdict to simplify the logic a little bit:

from collections import defaultdict
result = defaultdict(list)

for div in soup.find_all("div"):

    ns = div.nextSibling
    while ns is not None and ns.name != "div":
        if ns.name == "p":
            result[div.text].append(ns.text)
        ns = ns.nextSibling

result
# defaultdict(list,
#             {'This should be dict key1': ['This should be the value of key1',
#              'This should also be the value of key1'],
#              'This should be dict key2': ['This should be the value of key2']})


使用过的HTML :

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should also be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

这篇关于解析内联&lt; div&gt;的Beautiful Soup和&lt; p&gt;进入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆