过滤BeautifulSoup [英] Filtering BeautifulSoup

查看:285
本文介绍了过滤BeautifulSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从另一个网页得到学院和其网站的列表。

我已经得到了输入到显示,我想每一行的HTML,但我试图进一步格式化文本。我只是想显示的大学名称和链接到该大学。任何想法?

下面是我的code:

  URL =htt​​p://www.arizona.edu/colleges
页= urllib2.urlopen(URL)
汤= BeautifulSoup(page.read())
高校= soup.findAll('跨',{类:字段内容'})
在大学eachuniversity:
   打印eachuniversity

和这里是我所得到的例子 eachuniversity

 < D​​IV CLASS =观点场标题>
  <跨度类=字段内容>
    < A HREF =/大学/大专农业生命科学>
    < H3>农业放大器学院;放大器;生命科学与LT; / H3 GT&;
    &所述; / A>
  < / SPAN>
< / DIV>


解决方案

下面将得到你在找什么。用来做这个的信息是在 BeautifulSoup文档方便(< A HREF =htt​​p://www.crummy.com/software/BeautifulSoup/bs4/doc/相对=nofollow>第4版文档)。

 在大学UNI:
    链接= uni.find(a)的
    college_name = link.text
    web_page =链接[HREF]

对于第一个UNI(你的例子):

 &GT;&GT;&GT;打印web_page
/大学/大专农业生命科学
&GT;&GT;&GT;打印college_name
放大器;农业放大器学院;生命科学

我会留下处理相对/绝对的联系和特殊的HTML字符作为练习。


这是如何工作

从您的<一个href=\"http://stackoverflow.com/questions/12024679/how-can-i-use-beautifulsoup-to-search-for-p-tags-inside-of-certain-spans\">most最近的问题,据我了解,您无法从中提取对象标记。你的大学变量标签的集合对象,每个都可以用来访问它的儿童类字典对象。尝试读<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Navigating%20the%20Parse%20Tree\"相对=nofollow>导航解析树以得到解析与BeautifulSoup是如何工作的一个更好的处理。

I am trying to get a list of colleges and their web sites from another web page.

I have gotten the input down to display the HTML for each line that I want, but I am attempting to further format the text. I only want the college name and the link to that college to be displayed. Any ideas?

Here's my code:

url = "http://www.arizona.edu/colleges"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities = soup.findAll('span', {'class' : 'field-content'})
for eachuniversity in universities:
   print eachuniversity

And here's an example of what I get for eachuniversity:

<div class="views-field-title">
  <span class="field-content">
    <a href="/colleges/college-agriculture-life-sciences">
    <h3>College of Agriculture &amp; Life Sciences</h3>
    </a>
  </span>
</div>

解决方案

The following will get you what you're looking for. The information used to do this is easily accessible in the BeautifulSoup documentation (version 4 documentation).

for uni in universities:
    link = uni.find("a")
    college_name = link.text
    web_page = link["href"]

For the first uni (your example):

>>> print web_page
"/colleges/college-agriculture-life-sciences"
>>> print college_name
College of Agriculture &amp; Life Sciences

I'll leave handling relative/absolute links and special HTML characters as an exercise for you.


How this works

From your most recent question, I gather that you're having trouble extracting tags from the uni object. Your universities variable is a collection of Tag objects, each a dictionary-like object that can be used to access its children. Try reading "Navigating the Parse Tree" to get a better handle on how parsing works with BeautifulSoup.

这篇关于过滤BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆