调试:爬行 (404) [英] DEBUG: Crawled (404)

查看:108
本文介绍了调试:爬行 (404)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的代码:

# -*- coding: utf-8 -*-
import scrapy


class SinasharesSpider(scrapy.Spider):
    name = 'SinaShares'
    allowed_domains = ['money.finance.sina.com.cn/mkt/']
    start_urls = ['http://money.finance.sina.com.cn/mkt//']

    def parse(self, response):
        contents=response.xpath('//*[@id="list_amount_ctrl"]/a[2]/@class').extract()
        print(contents)

我在 setting.py 中设置了一个用户代理.

And I have set an user-agent in setting.py.

然后我得到一个错误:

2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance.sina.com.cn/robots.txt> (referer: None)
2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://money.finance.sina.com.cn/mkt//> (referer: None)

那么我怎样才能消除这个错误呢?

So How can I eliminate this error?

推荐答案

收到 http-statuscode 404 是因为 Scrapy 默认检查/robots.txt.在您的情况下,此站点不存在,因此会收到 404,但这没有任何影响.如果您想避免检查 robots.txt,您可以在 settings.py 中设置 ROBOTSTXT_OBEY = False.

The http-statuscode 404 is received because Scrapy is checking the /robots.txt by default. In your case this site does not exist and so a 404 is received but that does not have any impact. In case you want to avoid checking the robots.txt you can set ROBOTSTXT_OBEY = False in the settings.py.

然后网站访问成功(http-statuscode 200).不打印任何内容,因为根据您的 xpath-selection 没有选择任何内容.您必须修复您的 xpath 选择.

Then the website is accessed successfully (http-statuscode 200). No content is printed because based on your xpath-selection nothing is selected. You have to fix your xpath-selection.

如果您想测试不同的 xpath 或 css 选择以找出如何获取所需内容,您可能需要使用交互式 scrapy shell:
scrapy shellhttp://money.finance.sina.com.cn/mkt/"

If you want to test different xpath- or css-selections in order to figure how to get your desired content, you might want to use the interactive scrapy shell:
scrapy shell "http://money.finance.sina.com.cn/mkt/"

你可以在此处为官方 Scrapy 文档.

这篇关于调试:爬行 (404)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆