如何使用python从整个网站获取所有页面? [英] How to get all pages from the whole website using python?

查看:82
本文介绍了如何使用python从整个网站获取所有页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试制作一个可以从网站获取每个链接的工具.例如,我需要从 stackoverflow 获取所有问题页面.我试过用scrapy.

I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy.

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://stackoverflow.com/questions/']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            url_lnk = link.url
            print (url_lnk)

在这里,我只收到了来自起始页的问题.我需要做什么才能获得所有问题"链接.时间不重要,我只需要明白该做什么.

Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.

UPD

我想观察的网站是 https://sevastopol.su/ - 这是一个当地城市新闻网站.

The site which I want to observe is https://sevastopol.su/ - this is a local city news website.

此处应包含所有新闻列表:https://sevastopol.su/all-news

The list of all news should be containde here: https://sevastopol.su/all-news

在此页面的底部,您可以看到页码,但是如果我们转到新闻的最后一页,我们会看到它的编号为 765(现在是 19.06.2019),但它显示了最新的带有日期2018 年 6 月 19 日.所以最后一页只显示了一年前的新闻.但是也有很多新闻链接仍然存在(可能是从 2010 年开始),甚至可以在本网站的搜索页面中找到.所以这就是为什么我想知道是否可以访问该站点的某些全局链接存储.

In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site. So that is why I wanted to know if there can be an access to some global link store of this site.

推荐答案

这是您可能想要执行的操作,以获取指向所提出的不同问题的所有链接.但是,我认为您的脚本可能会在执行过程中的某处出现 404 错误,因为有数百万个链接需要解析.

This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.

按原样运行脚本:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ["https://stackoverflow.com/questions/"]

    def parse(self, response):
        for link in response.css('.summary .question-hyperlink::attr(href)').getall():
            post_link = response.urljoin(link)
            yield {"link":post_link}

        next_page = response.css("a[rel='next']::attr(href)").get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)

这篇关于如何使用python从整个网站获取所有页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆