在已知元素beautifulsoup之外获取文本 [英] Get text outside known element beautifulsoup
问题描述
我想抓取一个网页,并且我根本不想使用正则表达式.我正在用beautifulsoup处理刮擦.我有这个来源:
<TD WIDTH="50%" VALIGN="TOP"><span class="sections">Date:</span>
13 August 2014
<br> <br><span class="sections">Application Deadline:</span>
<font color="maroon">
28 August 2014</font>
<font color="#990066">Application closed / under review</font>
<br> <br><span class="sections">Duty Station: </span>
Multiple duty stations
<br>
我想从此来源抓取 2014年8月13日.
我可以找到按其类搜索的span元素:soup.findAll('span',{'class':'sections'}
获取第一个元素,并检查文本是否为"Date:",但这只是给我该元素.我要获取的文本位于其下,并且我唯一可以做的就是通过td
搜索,但这不是我想要的,因为一个td
中包含许多元素和文本./p>
我知道我可以使用正则表达式来做到这一点,但是我真的只是在尝试使用beautifulsoup来做到这一点.
预先感谢
找到了它.
一旦获得元素<span class="sections">Date:</span>
我必须做element.nextSibling
比我想象的要容易.
I want to scrape a webpage, and I don't want to use regex at all. I am using beautifulsoup to handle the scraping. I have this source:
<TD WIDTH="50%" VALIGN="TOP"><span class="sections">Date:</span>
13 August 2014
<br> <br><span class="sections">Application Deadline:</span>
<font color="maroon">
28 August 2014</font>
<font color="#990066">Application closed / under review</font>
<br> <br><span class="sections">Duty Station: </span>
Multiple duty stations
<br>
From this source, I want to scrape 13 August 2014.
I can find the span element searching by it's class with: soup.findAll('span',{'class':'sections'}
get the first element, and check if the text is "Date:" but this is just giving me the element. The text that I'm trying to get is under it, and the only thing I can do is searching by the td
but that's not what I want, because there are a lot of elements and text inside one td
.
I know that I could do it using regex, but I'm really trying to do it just with beautifulsoup.
Thanks in advance
Found it.
Once I get the element <span class="sections">Date:</span>
I have to do element.nextSibling
Easier than I thought.
这篇关于在已知元素beautifulsoup之外获取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!