解析多个< div>的Beautiful Soup和连续的< p>标签放入字典 [英] Beautiful Soup parsing multiple <div> and successive <p> tags into dictionary
问题描述
我有多个内联div(是页眉"),下面是段落标签(而不是在div中),从理论上讲是子级" ...我想将其转换为字典.我不太清楚做到这一点的最佳方法.该网站的外观大致如下:
I have multiple inline divs (which are 'headers), and paragraph tags beneath (not IN the divs), that are theoretically 'children'... I would like to convert this to a dictionary. I can't quite figure out the best way to do it. Here is roughly what the site looks like:
<div><span>This should be dict key1</span></div>
<p>This should be the value of key1</p>
<p>This should be the value of key1</p>
<div><span>This should be dict key2</span></div>
<p>This should be the value of key2</p>
我正在使用的Python代码如下:
The Python code I have working looks like this:
soup = bs.BeautifulSoup(source,'lxml')
full_discussion = soup.find(attrs={'class' : 'field field-type-text field-field-discussion'})
ava_discussion = full_discussion.find(attrs = {'class': 'field-item odd'})
for div in ava_discussion.find_all("div"):
discussion = []
if div.findNextSibling('p'):
discussion.append(div.findNextSibling('p').get_text())
location = div.get_text()
ava_dict.update({location: {"discussion": discussion}}
但是,问题是此代码仅添加了FIRST <p>
标记,然后将其移至下一个div.最终,我想我想将每个<p>
添加到discussion
的列表中.救命!
However, the problem is that this code only adds the FIRST <p>
tag, then it moves onto the next div. Ultimately, I think I'd like to add each <p>
into a list into discussion
. Help!
更新:
添加一个while
循环会为我生成第一个
标记的副本,其中包含多少个副本.这是代码:
Adding a while
loop yields me duplicates of the first
tags for how many exist. Here is the code:
for div in ava_discussion.find_all("div"):
ns = div.nextSibling
discussion = []
while ns is not None and ns.name != "div":
if ns.name == "p":
discussion.append(div.findNextSibling('p').get_text())
ns = ns.nextSibling
location = div.get_text()
ava_dict.update({location : {"discussion": discussion}})
print(json.dumps(ava_dict, indent=2))
推荐答案
我没有添加正确的文本.这段代码有效:
I wasn't adding the correct text. This code works:
for div in ava_discussion.find_all("div"):
ns = div.nextSibling
discussion = []
while ns is not None and ns.name != "div":
if ns.name == "p":
discussion.append(ns.get_text())
ns = ns.nextSibling
location = div.get_text()
ava_dict.update({location : {"discussion": discussion}})
print(json.dumps(ava_dict, indent=2))
这篇关于解析多个< div>的Beautiful Soup和连续的< p>标签放入字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!