过滤BeautifulSoup [英] Filtering BeautifulSoup
问题描述
我想从另一个网页得到学院和其网站的列表。
我已经得到了输入到显示,我想每一行的HTML,但我试图进一步格式化文本。我只是想显示的大学名称和链接到该大学。任何想法?
下面是我的code:
URL =http://www.arizona.edu/colleges
页= urllib2.urlopen(URL)
汤= BeautifulSoup(page.read())
高校= soup.findAll('跨',{类:字段内容'})
在大学eachuniversity:
打印eachuniversity
和这里是我所得到的例子 eachuniversity
:
< DIV CLASS =观点场标题>
<跨度类=字段内容>
< A HREF =/大学/大专农业生命科学>
< H3>农业放大器学院;放大器;生命科学与LT; / H3 GT&;
&所述; / A>
< / SPAN>
< / DIV>
下面将得到你在找什么。用来做这个的信息是在 BeautifulSoup文档方便(< A HREF =http://www.crummy.com/software/BeautifulSoup/bs4/doc/相对=nofollow>第4版文档)。
在大学UNI:
链接= uni.find(a)的
college_name = link.text
web_page =链接[HREF]
对于第一个UNI(你的例子):
&GT;&GT;&GT;打印web_page
/大学/大专农业生命科学
&GT;&GT;&GT;打印college_name
放大器;农业放大器学院;生命科学
我会留下处理相对/绝对的联系和特殊的HTML字符作为练习。
这是如何工作
从您的<一个href=\"http://stackoverflow.com/questions/12024679/how-can-i-use-beautifulsoup-to-search-for-p-tags-inside-of-certain-spans\">most最近的问题,据我了解,您无法从中提取单
对象标记。你的大学
变量标签的集合
对象,每个都可以用来访问它的儿童类字典对象。尝试读<一个href=\"http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Navigating%20the%20Parse%20Tree\"相对=nofollow>导航解析树以得到解析与BeautifulSoup是如何工作的一个更好的处理。
I am trying to get a list of colleges and their web sites from another web page.
I have gotten the input down to display the HTML for each line that I want, but I am attempting to further format the text. I only want the college name and the link to that college to be displayed. Any ideas?
Here's my code:
url = "http://www.arizona.edu/colleges"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities = soup.findAll('span', {'class' : 'field-content'})
for eachuniversity in universities:
print eachuniversity
And here's an example of what I get for eachuniversity
:
<div class="views-field-title">
<span class="field-content">
<a href="/colleges/college-agriculture-life-sciences">
<h3>College of Agriculture & Life Sciences</h3>
</a>
</span>
</div>
The following will get you what you're looking for. The information used to do this is easily accessible in the BeautifulSoup documentation (version 4 documentation).
for uni in universities:
link = uni.find("a")
college_name = link.text
web_page = link["href"]
For the first uni (your example):
>>> print web_page
"/colleges/college-agriculture-life-sciences"
>>> print college_name
College of Agriculture & Life Sciences
I'll leave handling relative/absolute links and special HTML characters as an exercise for you.
How this works
From your most recent question, I gather that you're having trouble extracting tags from the uni
object. Your universities
variable is a collection of Tag
objects, each a dictionary-like object that can be used to access its children. Try reading "Navigating the Parse Tree" to get a better handle on how parsing works with BeautifulSoup.
这篇关于过滤BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!