解析内联< div>的Beautiful Soup和< p>进入字典 [英] Beautiful Soup parsing inline <div> and <p> into dictionary
问题描述
我正在解析一个非常讨厌的网站.基本上,有内联div(是页眉")和下面的段落标签(不是在div中),从理论上讲是子级" ...我想将其转换为字典.我不太清楚做到这一点的最佳方法.该网站的外观大致如下:
I'm working on parsing a pretty nasty site. Basically, there is inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
因此,从理论上(并且错误地),python代码将像这样...
So, theoretically (and incorrectly), the python code would go something like this...
import bs4 as bs
dict = {"Key" : "Value"}
soup = bs.BeautifulSoup(source,'lxml')
for item in soup:
if item.tag == "div":
dict['key'] = item.text
if item.tag == "p":
dict['value'] = item.text
但是以某种方式,一旦找到下一个<div>
,它就需要中断,并开始一个新的键值.我很难缠这个...帮助!
But then somehow, once the next <div>
is found, it needs to break, and start a new key value. I'm having such a hard time wrapping my head around this... Help!
更新 建议的解决方案效果很好.
UPDATE The suggested solution worked beautifully.
推荐答案
您可以先找出所有div
,然后遍历div列表,为每个div找出其下一个同级标记p
文本,向find_all
函数添加更多属性约束,以确保它到达您想要的位置:
You can firstly find out all the div
s, then loop through the div list, for each div find out its next sibling tag p
's text, add more attribute constraints to the find_all
function to make sure it gets to where you want it to be:
{div.get_text(): div.findNextSibling('p').get_text() for div in soup.find_all("div")}
#{'This should be dict key1': 'This should be the value of key1',
# 'This should be dict key2': 'This should be the value of key2'}
更新:如果div
之后有多个p
标记,则只需循环遍历所有div并找出所有p
直到下一个div
并将它们添加为值到上一个键,这里使用defaultdict
简化了逻辑:
Update: if there are multiple p
tags following div
, then simply loop through all divs and find out all p
s until the next div
and add them as values to the previous key, here used a defaultdict
to simplify the logic a little bit:
from collections import defaultdict
result = defaultdict(list)
for div in soup.find_all("div"):
ns = div.nextSibling
while ns is not None and ns.name != "div":
if ns.name == "p":
result[div.text].append(ns.text)
ns = ns.nextSibling
result
# defaultdict(list,
# {'This should be dict key1': ['This should be the value of key1',
# 'This should also be the value of key1'],
# 'This should be dict key2': ['This should be the value of key2']})
使用过的HTML :
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should also be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
这篇关于解析内联< div>的Beautiful Soup和< p>进入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!