如何使用Beautiful Soup按文本内容选择div? [英] How to select div by text content using Beautiful Soup?

查看:109
本文介绍了如何使用Beautiful Soup按文本内容选择div?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试从类似的内容中抓取一些HTML.有时我需要的数据在div [0]中,有时在div [1]中,等等.

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.

想象每个人都参加3-5堂课.其中之一就是生物学.他们的成绩单始终按字母顺序排列.我要每个人的生物学成绩.

Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.

我已经将所有这些HTML抓取到了文本中,现在如何得出生物学等级?

I've already scraped all this HTML into a text, now how to fish out the Biology grades?

<div class = "student">
    <div class = "score">Algebra C-</div>
    <div class = "score">Biology A+</div>
    <div class = "score">Chemistry B</div>
</div>
<div class = "student">
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry A</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
</div>
<div class = "student">
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry C+</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Bangladeshi History C</div>
    <div class = "score">Biology B</div>
</div>

我正在使用漂亮的汤,我想我将不得不查找文本包含生物学"的div?

I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?

这只是为了方便快速阅读,我愿意在Excel或其他方面进行硬编码和修改.是的,这是一个劣质的网站!是的,他们确实有一个API,而且我对WDSL一无所知.

This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.

简短版本: http://www.legis.ga.gov/Legislation/zh-CN/Search.aspx ,以查找每张帐单FWIW上次执行的日期.这很麻烦,因为如果法案在第二个会议厅中没有发起人,而不是没有任何内容的div,那么他们根本就没有div.所以有时时间表在div 3中,有时在2中,以此类推.

Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.

推荐答案

(1)仅获得生物学等级,几乎是一个班轮.

(1) To just get the biology grade only, it is almost one liner.

import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology')) 
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores

输出看起来像这样:

[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']

(2)找到标签,也许对于其他任务,您需要找到parent:

(2) You locate the tags and maybe for further tasks, you need to find the parent:

import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs

输出看起来像这样:

[<div class="score">Biology A+</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>]

* 总而言之,您可以使用find_siblings/parent/... etc在HTML树中移动.*

有关如何导航树. 祝您工作顺利.

More information about how to navigate the tree. And Good luck with your work.

这篇关于如何使用Beautiful Soup按文本内容选择div?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆