解析BeautifulSoup html标签 [英] Parsing BeautifulSoup html tag
本文介绍了解析BeautifulSoup html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我需要使用BeautifulSoup解析HTML文件. HTML看起来像这样:
I need to parse an HTML file using BeautifulSoup. The HTML looks like that:
<div class="entry_container">
<div class="entry lang_en-gb" id="turn-over_1">
<span class="inline">
<h1 class="hwd">turn over</h1>
</span>
<div class="hom" id="turn-over_1.1">
<span class="gramGrp"><span class="pos">intransitive verb</span></span>
<div class="sense"><span class="bold">1 </span><span class="gramGrp"><span class="colloc"><span>[</span>person<span>]</span></span></span><span class="lbl"><span> (</span>in bed<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span class="cit" id="turn-over_1.2"><span>; </span></span></div>
<div class="sense"><span> <br/></span><span class="bold">2 </span><span class="gramGrp"><span class="colloc"><span>[</span>car<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span>, </span><span class="cit lang_fr"><span class="quote">faire un tonneau</span></span><span class="cit" id="turn-over_1.3"><span>; </span></span></div>
<div class="sense"><span> <br/></span><span class="bold">3 </span><span class="lbl"><span>(= </span>switch TV channels<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de chaîne</span></span><span class="cit" id="turn-over_1.4"><span>; </span></span></div>
</div>
<div class="hom" id="turn-over_1.5">
<span> <br/>▶ </span><span class="gramGrp"><span class="pos">transitive verb</span></span>
<div class="sense">
<span class="bold">1 </span>
<div class="sense"><span class="bold"> a </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>object<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">retourner</span></span><span class="cit" id="turn-over_1.6"><span>; </span></span></div>
<div class="sense"><span class="bold"> b </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>page<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">tourner</span></span></div>
<div class="sense"><span class="bold"> c </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>tape<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de face</span></span><span class="cit" id="turn-over_1.7"><span>; </span></span></div>
</div>
<div class="sense"><span> <br/></span><span class="bold">2 </span><span class="lbl"><span>(= </span>hand over<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">remettre</span></span><span class="cit" id="turn-over_1.8"><span>; </span></span><span class="cit" id="turn-over_1.9"><span>; </span></span></div>
</div>
</div>
</div>
我需要检索每个div class="hom"
的pos(span class="pos"
)和含义(每个<div class="sense">
)
I need to retrieve the pos (span class="pos"
) and the sense (each <div class="sense">
) of each div class="hom"
解析结果可能看起来像这样:
The results of parsing may look like this:
现在,我已经尝试了以下代码:
For now, I've try this code:
for gramGrp in entryContentHTML.find_all('div',attrs={"class":u"hom"}):
for pos in gramGrp.find('span',attrs={"class":u"gramGrp"}).find('span',attrs={"class":u"pos"}):
print pos
但是输出是:
intransitive verb
intransitive verb
transitive verb
推荐答案
您将不得不整理输出,但这将获得所需的内容:
You will have to tidy the output but this will get what you need:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
res= (["\n".join(s.strip() for s in x.text.splitlines()).replace(";","") for x in soup.find_all("div", {"class":"hom"})])
print("\n".join(res))
intransitive verb
1 [person] (in bed) se retourner
2 [car] se retourner, faire un tonneau
3 (= switch TV channels) changer de chaîne
▶ transitive verb
1
a [+ object] retourner
b [+ page] tourner
c [+ tape] changer de face
2 (= hand over) remettre
这篇关于解析BeautifulSoup html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文