Beautiful Soup 内联解析 <div>和<p>进入字典 [英] Beautiful Soup parsing inline <div> and <p> into dictionary
问题描述
我正在解析一个非常讨厌的网站.基本上,有内联 div(它们是标题")和下面的段落标签(不是在 div 中),理论上是孩子"......我想将其转换为字典.我想不出最好的方法来做到这一点.网站大致如下所示:
这应该是dict key1;<p>这应该是key1的值</p><div><span>这应该是 dict key2</span></div><p>这应该是key2的值</p>
所以,理论上(并且不正确),python 代码会像这样......
将 bs4 导入为 bsdict = {键":值"}汤 = bs.BeautifulSoup(source,'lxml')对于汤中的项目:如果 item.tag == "div":dict['key'] = item.text如果 item.tag == "p":dict['value'] = item.text
但是不知何故,一旦找到下一个
更新建议的解决方案效果很好.
你可以先找出所有的div
,然后循环遍历div列表,为每个div找出它的下一个兄弟标签p
的文本,向 find_all
函数添加更多属性约束以确保它到达您想要的位置:
{div.get_text(): div.findNextSibling('p').get_text() for Soup.find_all("div")}#{'This should be dict key1': 'This should be the value of key1',# 'This should be dict key2': 'This should be the value of key2'}
<小时>
更新:如果div
后面有多个p
标签,那么只需循环遍历所有的div并找出所有的p
s 直到下一个 div
并将它们作为值添加到前一个键,这里使用了一个 defaultdict
来稍微简化逻辑:
from collections import defaultdict结果 = defaultdict(列表)对于soup.find_all("div") 中的div:ns = div.nextSibling而 ns 不是 None 和 ns.name != "div":如果 ns.name == "p":结果[div.text].append(ns.text)ns = ns.nextSibling结果# defaultdict(列表,# {'This should be dict key1': ['This should be the value of key1',# '这也应该是key1的值'],# 'This should be dict key2': ['This should be the value of key2']})
<小时>
使用的HTML:
这应该是dict key1;<p>这应该是key1的值</p><p>这也应该是key1的值</p><div><span>这应该是 dict key2</span></div><p>这应该是key2的值</p>
I'm working on parsing a pretty nasty site. Basically, there is inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
So, theoretically (and incorrectly), the python code would go something like this...
import bs4 as bs
dict = {"Key" : "Value"}
soup = bs.BeautifulSoup(source,'lxml')
for item in soup:
if item.tag == "div":
dict['key'] = item.text
if item.tag == "p":
dict['value'] = item.text
But then somehow, once the next <div>
is found, it needs to break, and start a new key value. I'm having such a hard time wrapping my head around this... Help!
UPDATE The suggested solution worked beautifully.
You can firstly find out all the div
s, then loop through the div list, for each div find out its next sibling tag p
's text, add more attribute constraints to the find_all
function to make sure it gets to where you want it to be:
{div.get_text(): div.findNextSibling('p').get_text() for div in soup.find_all("div")}
#{'This should be dict key1': 'This should be the value of key1',
# 'This should be dict key2': 'This should be the value of key2'}
Update: if there are multiple p
tags following div
, then simply loop through all divs and find out all p
s until the next div
and add them as values to the previous key, here used a defaultdict
to simplify the logic a little bit:
from collections import defaultdict
result = defaultdict(list)
for div in soup.find_all("div"):
ns = div.nextSibling
while ns is not None and ns.name != "div":
if ns.name == "p":
result[div.text].append(ns.text)
ns = ns.nextSibling
result
# defaultdict(list,
# {'This should be dict key1': ['This should be the value of key1',
# 'This should also be the value of key1'],
# 'This should be dict key2': ['This should be the value of key2']})
Html used:
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should also be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
这篇关于Beautiful Soup 内联解析 <div>和<p>进入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!