Beautiful Soup 内联解析 <div>和<p>进入字典 [英] Beautiful Soup parsing inline <div> and <p> into dictionary

查看:18
本文介绍了Beautiful Soup 内联解析 <div>和<p>进入字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一个非常讨厌的网站.基本上,有内联 div(它们是标题")和下面的段落标签(不是在 div 中),理论上是孩子"......我想将其转换为字典.我想不出最好的方法来做到这一点.网站大致如下所示:

这应该是dict key1

;<p>这应该是key1的值</p><div><span>这应该是 dict key2</span></div><p>这应该是key2的值</p>

所以,理论上(并且不正确),python 代码会像这样......

将 bs4 导入为 bsdict = {键":值"}汤 = bs.BeautifulSoup(source,'lxml')对于汤中的项目:如果 item.tag == "div":dict['key'] = item.text如果 item.tag == "p":dict['value'] = item.text

但是不知何故,一旦找到下一个

,它需要中断,并开始一个新的键值.我很难解决这个问题......救命!

更新建议的解决方案效果很好.

解决方案

你可以先找出所有的div,然后循环遍历div列表,为每个div找出它的下一个兄弟标签p 的文本,向 find_all 函数添加更多属性约束以确保它到达您想要的位置:

{div.get_text(): div.findNextSibling('p').get_text() for Soup.find_all("div")}#{'This should be dict key1': 'This should be the value of key1',# 'This should be dict key2': 'This should be the value of key2'}

<小时>

更新:如果div后面有多个p标签,那么只需循环遍历所有的div并找出所有的ps 直到下一个 div 并将它们作为值添加到前一个键,这里使用了一个 defaultdict 来稍微简化逻辑:

from collections import defaultdict结果 = defaultdict(列表)对于soup.find_all("div") 中的div:ns = div.nextSibling而 ns 不是 None 和 ns.name != "div":如果 ns.name == "p":结果[div.text].append(ns.text)ns = ns.nextSibling结果# defaultdict(列表,# {'This should be dict key1': ['This should be the value of key1',# '这也应该是key1的值'],# 'This should be dict key2': ['This should be the value of key2']})

<小时>

使用的HTML:

这应该是dict key1

;<p>这应该是key1的值</p><p>这也应该是key1的值</p><div><span>这应该是 dict key2</span></div><p>这应该是key2的值</p>

I'm working on parsing a pretty nasty site. Basically, there is inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

So, theoretically (and incorrectly), the python code would go something like this...

import bs4 as bs   

dict = {"Key" : "Value"}


soup = bs.BeautifulSoup(source,'lxml')
for item in soup:
    if item.tag == "div":
        dict['key'] = item.text
        if item.tag == "p":
            dict['value'] = item.text

But then somehow, once the next <div> is found, it needs to break, and start a new key value. I'm having such a hard time wrapping my head around this... Help!

UPDATE The suggested solution worked beautifully.

解决方案

You can firstly find out all the divs, then loop through the div list, for each div find out its next sibling tag p's text, add more attribute constraints to the find_all function to make sure it gets to where you want it to be:

{div.get_text(): div.findNextSibling('p').get_text() for div in soup.find_all("div")}

#{'This should be dict key1': 'This should be the value of key1',
# 'This should be dict key2': 'This should be the value of key2'}


Update: if there are multiple p tags following div, then simply loop through all divs and find out all ps until the next div and add them as values to the previous key, here used a defaultdict to simplify the logic a little bit:

from collections import defaultdict
result = defaultdict(list)

for div in soup.find_all("div"):

    ns = div.nextSibling
    while ns is not None and ns.name != "div":
        if ns.name == "p":
            result[div.text].append(ns.text)
        ns = ns.nextSibling

result
# defaultdict(list,
#             {'This should be dict key1': ['This should be the value of key1',
#              'This should also be the value of key1'],
#              'This should be dict key2': ['This should be the value of key2']})


Html used:

<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should also be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>

这篇关于Beautiful Soup 内联解析 &lt;div&gt;和&lt;p&gt;进入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆