过滤BeautifulSoup [英] Filtering BeautifulSoup

查看：285 发布时间：2016/8/5 19:20:46 python beautifulsoup

本文介绍了过滤BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从另一个网页得到学院和其网站的列表。

我已经得到了输入到显示，我想每一行的HTML，但我试图进一步格式化文本。我只是想显示的大学名称和链接到该大学。任何想法？

下面是我的code：

  URL =http://www.arizona.edu/colleges
页= urllib2.urlopen（URL）
汤= BeautifulSoup（page.read（））
高校= soup.findAll（'跨'，{类：字段内容'}）
在大学eachuniversity：
   打印eachuniversity

和这里是我所得到的例子 eachuniversity ：

 ＆LT; DIV CLASS =观点场标题＆GT;
  ＆LT;跨度类=字段内容＆GT;
    ＆LT; A HREF =/大学/大专农业生命科学＆GT;
    ＆LT; H3＆GT;农业放大器学院;放大器;生命科学与LT; / H3 GT＆;
    ＆所述; / A＆GT;
  ＆LT; / SPAN＆GT;
＆LT; / DIV＆GT;

解决方案

下面将得到你在找什么。用来做这个的信息是在 BeautifulSoup文档方便（< A HREF =http://www.crummy.com/software/BeautifulSoup/bs4/doc/相对=nofollow>第4版文档）。

 在大学UNI：
    链接= uni.find（a）的
    college_name = link.text
    web_page =链接[HREF]

对于第一个UNI（你的例子）：

 ＆GT;＆GT;＆GT;打印web_page
/大学/大专农业生命科学
＆GT;＆GT;＆GT;打印college_name
放大器;农业放大器学院;生命科学

我会留下处理相对/绝对的联系和特殊的HTML字符作为练习。

这是如何工作

从您的<一个href=\"http://stackoverflow.com/questions/12024679/how-can-i-use-beautifulsoup-to-search-for-p-tags-inside-of-certain-spans\">most最近的问题，据我了解，您无法从中提取单对象标记。你的大学变量标签的集合对象，每个都可以用来访问它的儿童类字典对象。尝试读<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Navigating%20the%20Parse%20Tree\"相对=nofollow>导航解析树以得到解析与BeautifulSoup是如何工作的一个更好的处理。

I am trying to get a list of colleges and their web sites from another web page.

I have gotten the input down to display the HTML for each line that I want, but I am attempting to further format the text. I only want the college name and the link to that college to be displayed. Any ideas?

Here's my code:

url = "http://www.arizona.edu/colleges"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities = soup.findAll('span', {'class' : 'field-content'})
for eachuniversity in universities:
   print eachuniversity

And here's an example of what I get for eachuniversity:

<div class="views-field-title">
  <span class="field-content">
    <a href="/colleges/college-agriculture-life-sciences">
    <h3>College of Agriculture &amp; Life Sciences</h3>
    </a>
  </span>
</div>

解决方案

The following will get you what you're looking for. The information used to do this is easily accessible in the BeautifulSoup documentation (version 4 documentation).

for uni in universities:
    link = uni.find("a")
    college_name = link.text
    web_page = link["href"]

For the first uni (your example):

>>> print web_page
"/colleges/college-agriculture-life-sciences"
>>> print college_name
College of Agriculture &amp; Life Sciences

I'll leave handling relative/absolute links and special HTML characters as an exercise for you.

How this works

From your most recent question, I gather that you're having trouble extracting tags from the uni object. Your universities variable is a collection of Tag objects, each a dictionary-like object that can be used to access its children. Try reading "Navigating the Parse Tree" to get a better handle on how parsing works with BeautifulSoup.

这篇关于过滤BeautifulSoup的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

过滤BeautifulSoup [英] Filtering BeautifulSoup

问题描述

这是如何工作

How this works

相关文章

Python最新文章

热门教程

热门工具

登录关闭

过滤BeautifulSoup [英] Filtering BeautifulSoup

问题描述

这是如何工作

How this works

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭