如何在Python中提取网页的某些部分 [英] How to extract certain parts of a web page in Python

查看:133
本文介绍了如何在Python中提取网页的某些部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标网页:
http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm



我想要的部分提取:

 < tr> 
< td> Skilled& - ndash;独立(Residence)子类885< br />在线< / td>
< td> N / A< / td>
< td> N / A< / td>
< td> N / A< / td>
< td> 2011年5月15日< / td>
< td> N / A< / td>
< / tr>

一旦代码在网上搜索关键字 subclass 885
,那么它应该打印第5个标签中的日期,该日期为 2011年5月15日,如上所示。 b
$ b

这只是一个监视器,让我可以密切关注移民申请的进度。 > Beau - ootiful Soo - oop!



Beau - ootiful Soo - oop!



e - e - evening,





> - 刘易斯卡罗尔, 爱丽丝梦游仙境



我认为这正是他想到的!

模拟龟可能做这样的事情:

 >>> from BeautifulSoup import BeautifulSoup 
>>> import urllib2
>>> url ='http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'
>>> page = urllib2.urlopen(url)
>>>汤= BeautifulSoup(页)
>>>对于row.html.body.findAll('tr')中的行:
... data = row.findAll('td')
...如果数据和'subclass 885online'in data [ 0] .text:
... print data [4] .text
...
2011年5月15日

但是我不确定这会有帮助,因为那个日期已经过去了!



祝您的程序运行顺利!

Target web page: http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm

The section I want to extract:

  <tr>
  <td>Skilled &ndash; Independent (Residence) subclass 885<br />online</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>N/A</td>
  <td>15 May 2011</td>
  <td>N/A</td>
  </tr>

Once the code finds this section by searching the keyword "subclass 885
online
", it should then print the date which is within the 5th tag which is "15 May 2011" as shown above.

It's just a monitor for myself to keep an eye on the progress of my immigration application.

解决方案

"Beau--ootiful Soo--oop!

Beau--ootiful Soo--oop!

Soo--oop of the e--e--evening,

Beautiful, beauti--FUL SOUP!"

--Lewis Carroll, Alice's Adventures in Wonderland

I think this is exactly what he had in mind!

The Mock Turtle would probably do something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm'
>>> page = urllib2.urlopen(url)
>>> soup = BeautifulSoup(page)
>>> for row in soup.html.body.findAll('tr'):
...     data = row.findAll('td')
...     if data and 'subclass 885online' in data[0].text:
...         print data[4].text
... 
15 May 2011

But I'm not sure it would help, since that date has already passed!

Good luck with the application!

这篇关于如何在Python中提取网页的某些部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆