使Urllib2在页面之间移动 [英] Make Urllib2 move through pages

查看:74
本文介绍了使Urllib2在页面之间移动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试刮除 http://targetstudy.com/school/schools-in -chhattisgarh.html

我正在使用lxml.html,urllib2

I am usling lxml.html, urllib2

我想以某种方式,通过单击下一页链接来关注所有页面并下载其源代码. 并使其停在最后一页. 下一页的href是['?recNo = 25']

I want somehow, follow all the pages by clicking the next page link and download its source. And make it stop at the last page. The href for next page is ['?recNo=25']

有人可以建议如何做吗, 预先感谢.

Could someone please advise how to do that, Thanks in advance.

这是我的代码,

    import urllib2
    import lxml.html
    import itertools
    url = "http://targetstudy.com/school/schools-in-chhattisgarh.html"
    req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
    stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
    tree = lxml.html.fromstring(stuff)
    print stuff

    links = tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href")
    for link in links:
        req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
        stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
        tree = lxml.html.fromstring(stuff)
        print stuff
        links = tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href")

但是它要做的只是进入第二页而不是更进一步.

But all its doing is going to the 2nd page and NOT going further.

请帮助我

推荐答案

我希望您的所有问题都源于循环结束时覆盖您的列表.假设您的其余代码正常工作,这可能是一个更好的解决方案.

I expect all your problems are from overwriting your list at the end of the loop. Assuming the rest of your code works, this might be a better solution.

import urllib2
import lxml.html
import itertools
url = "http://targetstudy.com/school/schools-in-chhattisgarh.html"
req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
tree = lxml.html.fromstring(stuff)
print stuff

links = [url]
visited = []
while len(links) > 0:
    # take a link out of the list and mark it as visited
    link = links.pop()
    visited.append(link)

    # open the link and read the contents
    req = urllib2.Request(link, headers={ 'User-Agent': 'Mozilla/5.0' })
    stuff = urllib2.urlopen(req).read().encode('ascii', 'ignore')
    tree = lxml.html.fromstring(stuff)
    print stuff

    # for every link in the page
    for new_link in tree.xpath("(//ul[@class='pagination']/li/a)[last()]/@href"):
        # if link has not been visited yet and is not in the list to visit next
        if new_link not in links and new_link not in visited:
            # add the new link to the list of links to visit
            links.append(new_link)

这篇关于使Urllib2在页面之间移动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆