Scrapy:未处理或不允许 HTTP 状态代码? [英] Scrapy: HTTP status code is not handled or not allowed?

查看:38
本文介绍了Scrapy:未处理或不允许 HTTP 状态代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取类别中的产品标题、链接、价格

我的文件:spiders/tiki.py

导入scrapy从scrapy.linkextractors 导入LinkExtractor从 scrapy.spider 导入 CrawlSpider,规则从 stackdata.items 导入 StackdataItem类 StackdataSpider(CrawlSpider):名称 = tiki"allowed_domains = [tiki.vn"]start_urls = [https://tiki.vn/dien-thoai-may-tinh-bang/c1789",]规则 = (规则(LinkExtractor(allow=r"\?page=2"),回调=parse_item",跟随=真),)def parse_item(self, response):questions = response.xpath('//div[@class="product-item"]')对于问题中的问题:question_location = question.xpath('//a/@href').extract()[0]full_url = response.urljoin(question_location)产生scrapy.Request(full_url,回调=self.parse_question)def parse_question(self, response):item = StackdataItem()item[title"] = response.css(.item-box h1::text").extract()[0]item[url"] = response.urlitem[内容"] = response.css(.price span::text").extract()[0]产量项目

文件:items.py

导入scrapy类 StackdataItem(scrapy.Item):标题 = scrapy.Field()url = scrapy.Field()价格=scrapy.Field()

请帮帮我!!!!谢谢!

解决方案

tl;dr

你被scrapy的用户代理屏蔽了.

您有两个选择:

  1. 满足网站的愿望,不要抓取它们,或者
  2. 更改您的用户代理

我假设您想选择选项 2.

转到您的 scrapy 项目中的 settings.py 并将您的用户代理设置为非默认值.您自己的项目名称(它可能不应该包含 scrapy 一词)或标准浏览器的用户代理.

USER_AGENT='my-cool-project (http://example.com)'

详细的错误分析

我们都想学习,所以这里解释一下我是如何得到这个结果的,以及如果你再次看到这种行为你该怎么做.

网站 tiki.vn 似乎返回

在这里我们可以看到 robots.txt 返回了一个有效的 200 状态代码.

有待进一步调查

许多网站试图限制抓取,因此他们尝试检测抓取行为.因此,他们会查看一些指标并决定是向您提供内容还是阻止您的请求.我想这正是发生在你身上的事情.

我想抓取一个网站,该网站在我的家用 PC 上运行良好,但对来自我的服务器的任何请求(scrapy、wget、curl ……)根本没有响应(甚至 404).

>

您必须采取的后续步骤来分析此问题的原因:

  • 您能否从家用 PC 访问该网站(并且您是否收到状态代码 200)?
  • 如果您在家用 PC 上运行 scrapy 会怎样?还是 404?
  • 尝试从运行scrapy的服务器加载网站(例如使用wget或curl)

您可以像这样使用 wget 获取它:

wget https://tiki.vn/dien-thoai-may-tinh-bang/c1789

wget 确实发送自定义用户代理,因此您可能希望将其设置为 网络浏览器的用户代理 如果此命令不起作用(它在我的 PC 上起作用).

wget -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

这将帮助您找出问题是否出在服务器上(例如,他们阻止了 IP 或整个 IP 范围),或者您是否需要对蜘蛛进行一些修改.

检查用户代理

如果它与 wget 一起用于您的服务器,我怀疑scrapy 的用户代理是问题所在.根据文档,scrapy 确实使用 Scrapy/VERSION (+http://scrapy.org) 作为用户代理,除非你自己设置.他们很可能会根据用户代理阻止您的蜘蛛.

因此,您必须转到scrapy 项目中的settings.py 并在那里查找设置USER_AGENT.现在,将其设置为不包含关键字 scrapy 的任何内容.如果您想友好一点,请使用您的项目名称 + 域名,否则请使用标准浏览器用户代理.

不错的变体:

USER_AGENT='my-cool-project (http://example.com)'

不太好(但在抓取中很常见)变体:

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

事实上,我能够使用本地 PC 上的 wget 命令验证它们是否在用户代理上阻塞:

wget -U 'Scrapy/1.3.0 (+http://scrapy.org)' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

结果

--2017-10-14 18:54:04-- https://tiki.vn/dien-thoai-may-tinh-bang/c1789加载的 CA 证书 '/etc/ssl/certs/ca-certificates.crt'正在解决 tiki.vn... 203.162.81.188正在连接到 tiki.vn|203.162.81.188|:443... 已连接.已发送 HTTP 请求,正在等待响应... 404 Not Found2017-10-14 18:54:06 错误 404:未找到.

I want to get product title,link,price in category https://tiki.vn/dien-thoai-may-tinh-bang/c1789

But it fails "HTTP status code is not handled or not allowed":

My file: spiders/tiki.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from stackdata.items import StackdataItem


class StackdataSpider(CrawlSpider):
    name = "tiki"
    allowed_domains = ["tiki.vn"]
    start_urls = [
        "https://tiki.vn/dien-thoai-may-tinh-bang/c1789",
    ]

    rules = (
        Rule(LinkExtractor(allow=r"\?page=2"),
             callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        questions = response.xpath('//div[@class="product-item"]')

        for question in questions:
            question_location = question.xpath(
                '//a/@href').extract()[0]
            full_url = response.urljoin(question_location)
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        item = StackdataItem()
        item["title"] = response.css(
            ".item-box h1::text").extract()[0]
        item["url"] = response.url
        item["content"] = response.css(
            ".price span::text").extract()[0]
        yield item

File: items.py

import scrapy


class StackdataItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    price = scrapy.Field()

Please help me!!!! thanks!

解决方案

tl;dr

You are being blocked based on scrapy's user-agent.

You have two options:

  1. Grant the wish of the website and do not scrape them, or
  2. Change your user-agent

I assume you want to take option 2.

Go to your settings.py in your scrapy project and set your user-agent to a non-default value. Either your own project name (it probably should not contain the word scrapy) or a standard browser's user-agent.

USER_AGENT='my-cool-project (http://example.com)'

Detailed error analysis

We all want to learn, so here is an explanation of how I got to this result and what you can do if you see such behavior again.

The website tiki.vn seems to return HTTP status 404 for all requests of your spider. You can see in your screenshot that you get a 404 for both your requests to /robots.txt and /dien-thoai-may-tinh-bang/c1789.

404 means "not found" and web servers use this to show that a URL does not exist. However, if we check the same sites manually, we can see that both sites contain valid content. Now, it could technically be possible that these sites return both content and a 404 error code at the same time, but we can check this with the developer console of our browser (e.g. Chrome or Firefox).

Here we can see that robots.txt returns a valid 200 status code.

Further investigations to be done

Many web sites try to restrict scraping, so they try to detect scraping behavior. So, they will look at some indicators and decide if they will serve content to you or block your request. I assume that exactly this is what's happening to you.

I wanted to crawl one website, which worked totally fine from my home PC, but did not respond at all (not even 404) to any request from my server (scrapy, wget, curl, ...).

Next steps you'll have to take to analyze the reason for this issue:

  • Can you reach the website from your home PC (and do you get Status code 200)?
  • What happens if you run scrapy from your home PC? Still 404?
  • Try to load the website from the server, on which you run scrapy (e.g. with wget or curl)

You can fetch it with wget like this:

wget https://tiki.vn/dien-thoai-may-tinh-bang/c1789

wget does send a custom user-agent, so you might want to set it to a web browser's user-agent if this command does not work (it does from my PC).

wget -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

This will help you to find out if the problem is with the server (e.g. they blocked the IP or a whole IP range) or if you need to make some modifications to your spider.

Checking the user-agent

If it works with wget for your server, I would suspect the user-agent of scrapy to be the problem. According to the documentation, scrapy does use Scrapy/VERSION (+http://scrapy.org) as the user-agent unless you set it yourself. It's quite possible that they block your spider based on the user-agent.

So, you have to go to settings.py in your scrapy project and look for the settings USER_AGENT there. Now, set this to anything which does not contain the keyword scrapy. If you want to be nice, use your project name + domain, otherwise use a standard browser user-agent.

Nice variant:

USER_AGENT='my-cool-project (http://example.com)'

Not so nice (but common in scraping) variant:

USER_AGENT='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'

In fact, I was able to verify that they block on the user-agent with this wget command from my local PC:

wget -U 'Scrapy/1.3.0 (+http://scrapy.org)' https://tiki.vn/dien-thoai-may-tinh-bang/c1789

which results in

--2017-10-14 18:54:04--  https://tiki.vn/dien-thoai-may-tinh-bang/c1789
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving tiki.vn... 203.162.81.188
Connecting to tiki.vn|203.162.81.188|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-10-14 18:54:06 ERROR 404: Not Found.

这篇关于Scrapy:未处理或不允许 HTTP 状态代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆