Scrapy-Recursively 抓取网页并将内容保存为 html 文件 [英] Scrapy-Recursively Scrape Webpages and save content as html file

查看:48
本文介绍了Scrapy-Recursively 抓取网页并将内容保存为 html 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy提取网页标签中的信息,然后将这些网页保存为HTML文件.例如http://www.austlii.edu.au/au/cases/cth/HCA/1945/ 这个网站有一些与司法案件相关的网页.我想去到每个链接,并仅将与特定司法案件相关的内容保存为 HTML 页面.例如,转到此 http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html 然后保存case相关信息.

I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.

有没有办法在scrapy中递归执行此操作并将内容保存在HTML页面中

Is there a way to do this recursively in scrapy and save content in HTML page

推荐答案

是的,你可以用 Scrapy,链接提取器 会有所帮助:

Yes, you can do it with Scrapy, Link Extractors will help:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector


class AustliiSpider(CrawlSpider):
    name = "austlii"
    allowed_domains = ["austlii.edu.au"]
    start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # do whatever with html content (response.body variable)

希望有所帮助.

这篇关于Scrapy-Recursively 抓取网页并将内容保存为 html 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆