如何使用 Scrapy 抓取本地 HTML 文件 [英] How to crawl local HTML file with Scrapy

查看:118
本文介绍了如何使用 Scrapy 抓取本地 HTML 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用以下代码抓取存储在桌面中的本地 HTML 文件,但在抓取过程之前遇到以下错误,例如没有此类文件或目录:'/robots.txt'".

I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".

  • 是否可以在本地计算机(Mac)中抓取本地 HTML 文件?
  • 如果可能,我该怎么做设置allowed_domains"和start_urls"等参数?

[Scrapy 命令]

[Scrapy command]

$ scrapy crawl test -o test01.csv

[Scrapy蜘蛛]

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = []
    start_urls = ['file:///Users/Name/Desktop/test/test.html']

[错误]

2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'

推荐答案

在本地使用它时,我从不指定 allowed_domains.尝试取出那行代码,看看它是否有效.

When working with it locally, I never specify the allowed_domains. Try to take that line of code out and see if it works.

在您的错误中,它测试了您提供的空"域.

In your error its testing the 'empty' domain that you have given it.

这篇关于如何使用 Scrapy 抓取本地 HTML 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆