如何使用 Beautiful Soup 按文本内容选择 div? [英] How to select div by text content using Beautiful Soup?

查看:24
本文介绍了如何使用 Beautiful Soup 按文本内容选择 div?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图从这样的东西中抓取一些 HTML.有时我需要的数据在div[0],有时在div[1]等

想象一下,每个人都上 3-5 节课.其中之一始终是生物学.他们的成绩单总是按字母顺序排列.我想要每个人的生物成绩.

我已经把所有这些 HTML 都刮成了一个文本,现在如何找出生物学成绩?

<div class = "score">代数C-</div><div class = "score">生物A+</div><div class = "score">化学B</div>

<div class = "student"><div class = "score">生物B</div><div class = "score">化学A</div>

<div class = "student"><div class = "score">Alchemy D</div><div class = "score">代数A</div><div class = "score">生物B</div>

<div class = "student"><div class = "score">代数A</div><div class = "score">生物B</div><div class = "score">化学C+</div>

<div class = "student"><div class = "score">Alchemy D</div><div class = "score">代数A</div><div class = "score">Bangladeshi History C</div><div class = "score">生物B</div>

我用的是漂亮的汤,我想我必须找到文本包含生物学"的 div?

这只是为了快速抓取,我愿意在 Excel 或诸如此类的东西中进行硬编码和摆弄.是的,这是一个劣质网站!是的,他们确实有 API,但我对 WDSL 一无所知.

简短版本:http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,查找每张账单上最后一次行动的日期,FWIW.这很麻烦,因为如果一项法案在第二议院没有发起人,而不是一个不包含任何内容的 div,他们只是在那里根本没有一个 div.所以有时时间线在 div 3 中,有时是 2 等.

解决方案

(1) 光是生物成绩,差不多就是一个班轮了.

import bs4,re汤 = bs4.BeautifulSoup(html)score_string = soup.find_all(text=re.compile('生物学'))score = [score_string.split()[-1] for score_string in score_string]打印scores_string打印乐谱

输出如下:

[u'生物学A+', u'生物学B', u'生物学B', u'生物学B', u'生物学B'][u'A+', u'B', u'B', u'B', u'B']

(2) 您找到标签,也许为了进一步的任务,您需要找到 parent:

import bs4,re汤 = bs4.BeautifulSoup(html)分数 = 汤.find_all(text=re.compile('生物学'))divs = [score.parent for score in score]打印 div

输出如下:

[

Biology A+

,<div class="score">生物B</div>,<div class="score">生物B</div>,<div class="score">生物B</div>,<div class="score">生物学B</div>]

*总而言之,您可以使用 find_siblings/parent/...etc 来移动 HTML 树.*

有关如何导航树的更多信息.祝你工作顺利.

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.

Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.

I've already scraped all this HTML into a text, now how to fish out the Biology grades?

<div class = "student">
    <div class = "score">Algebra C-</div>
    <div class = "score">Biology A+</div>
    <div class = "score">Chemistry B</div>
</div>
<div class = "student">
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry A</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
</div>
<div class = "student">
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry C+</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Bangladeshi History C</div>
    <div class = "score">Biology B</div>
</div>

I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?

This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.

Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.

解决方案

(1) To just get the biology grade only, it is almost one liner.

import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology')) 
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores

The output looks like this:

[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']

(2) You locate the tags and maybe for further tasks, you need to find the parent:

import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs

Output looks like this:

[<div class="score">Biology A+</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>]

*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*

More information about how to navigate the tree. And Good luck with your work.

这篇关于如何使用 Beautiful Soup 按文本内容选择 div?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆