Scrapy 返回 Crawled (406) 错误/抓取每周通告的问题 [英] Scrapy returns Crawled (406) error / problem scraping weekly circulars

查看:35
本文介绍了Scrapy 返回 Crawled (406) 错误/抓取每周通告的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取 Rite Aid 的每周通告,通过网络"选项卡,我发现 json 链接中包含我需要的内容(品牌名称、价格、折扣).从抓取过去的json项目,我在scrapy上创建了这个

I wanted to scrape Rite Aid's weekly circular and through the Network tab, I found the json link filled with what I needed (brand name, price, discount). From scraping past json projects, I created this on scrapy

class RiteAidSpider(scrapy.Spider):
    name = 'riteaid'
    start_urls = ['https://weeklyad.info.riteaid.com/flyer_data/3444750?locale=en-US']

def parse(self, response):
    data = json.loads(response.body)
    for items in data:
        item_name = items['display_name']
        sales_price = items['current_price']
        pre_price = items['pre_price_text']
        yield {
            'store': 'rite aid',
            'name': item_name,
            'discount': pre_price,
            'sales_price': sales_price,
        }

但是当我运行程序时,我得到Scrapy Crawled (406) HTTP status code is not processing or not allowed."

but when I run the program, I get "Scrapy Crawled (406) HTTP status code is not handled or not allowed."

我觉得奇怪的一件事是,当我在浏览器中输入 start_url 时,json 没有出现.从过去的抓取项目来看,每当我将 json 链接放入浏览器时,我仍然可以看到 json 数据,但不是为此.我不明白为什么它不会出现.

One thing that I find weird is when I enter the start_url in my browser, the json doesn't appear. From past scraping projects, whenever I put the json link in my browser, I could still see the json data, but not for this. I don't understand why it won't show up.

谁能指出我正确的方向并告诉我我做错了什么或者我必须学习什么才能使这项工作成功?

Can anyone point me in the right direction and tell me what I'm doing wrong or what I have to learn in order to make this work?

推荐答案

The (HTTP) 406 Not Acceptable (reference) 客户端错误响应代码表示服务器无法产生与请求的主动内容协商标头中定义的可接受值列表匹配的响应,并且服务器是不愿意的提供默认表示.

The (HTTP) 406 Not Acceptable (reference) client error response code indicates that the server cannot produce a response matching the list of acceptable values defined in the request's proactive content negotiation headers, and that the server is unwilling to supply a default representation.

在我看来,您尝试通信的服务器具有一些验证/验证,不会将数据返回给网络爬虫.

In my opinion the server you're trying to communicate has some validations / verifications to not return data to web crawlers.

如果以前有效,现在无效,也许它只是现在才起作用的保护.如果您想尽量避免这个问题,也许您可​​以从脚本中更改一些默认"设置.

If it worked before and now it is not, maybe it is a protection put to work only now. If you would like to try to avoid that problem, maybe you can change some 'default' settings from your script.

您可以更改当前项目中的 settings.py 文件,在此处查找该信息并根据您机器中安装的当前网络浏览器进行更改.

You can change the settings.py file in you current project, look up for that information here and change according to a current web browser installed in your machine.

访问开发者工具(F12)网络标签,在站点中进行导航,从该域中的某个请求中复制信息.

Access the developer tools (F12) network tab, do a navigation in site, copy information from some request in that domain.

settings.py

USER_AGENT = '' #copy user-agent from browser

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': '', #copy this info from browser
  'Accept-Language': '', #copy this info from browser
}

这是一个很好的教程 (2017),详细解释你如何处理导航错误,旧但你可以得到主要思想.本教程也在scrapy网站resources部分链接.

Here is a good tutorial (2017) that explains in detail how can you handle navigation errors, old but you can get the main idea. This tutorial is also linked in scrapy web site resources section.

希望有帮助.

这篇关于Scrapy 返回 Crawled (406) 错误/抓取每周通告的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆