如何使用scrapy检查网站是否支持http、htts和www前缀 [英] How check if website support http, htts and www prefix with scrapy

查看:66
本文介绍了如何使用scrapy检查网站是否支持http、htts和www前缀的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用 http://example.comhttps://example.com 或 <代码>http://www.example.com.当我创建scrapy请求时,它工作正常.例如,在我的 page1.com 上,它总是被重定向到 https://.我需要将此信息作为返回值获取,或者是否有更好的方法来使用 scrapy 获取此信息?

I am using scrapy to check, if some website works fine, when I use http://example.com, https://example.com or http://www.example.com. When I create scrapy request, it works fine. for example, on my page1.com, it is always redirected to https://. I need to get this information as return value, or is there better way how to get this information using scrapy?

class myspider(scrapy.Spider):
    name = 'superspider'

    start_urls = [
        "https://page1.com/"
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        url = response.url
        # removing all possible prefixes from url
        for remove in ['https://', 'http://', 'www.']:
            url = str(url).replace(remove, '').rstrip('/')

        # Try with all possible prefixes
        for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
            yield scrapy.Request(url='{}{}'.format(prefix, url), callback=self.test, dont_filter=True)

    def test(self, response):
        print(response.url, response.status)

这个蜘蛛的输出是这样的:

The output of this spider is this:

https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200

这很好,但我想将此信息作为返回值来了解,例如http 上是响应代码 200,然后将其保存到字典中供以后处理或将其保存为 json 到文件(使用scrapy 中的项目).

This is nice, but I would like to get this information as return value to know, that e.g. on http is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).

期望的输出:我想要一个名为 a 的字典,其中包含所有信息:

DESIRED OUTPUT: I would like to have dictionary named a with all information:

print(a)
{'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True}

稍后我想抓取更多信息,所以我需要将所有信息存储在一个 object/json/...

Later I would like to scrape more information, so I will need to store all information under one object/json/...

推荐答案

您可以解析 request.url,而不是使用 eLRuLL 指出的元可能性:

Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:

scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'

In [2]: response.url
Out[2]: 'https://stackoverflow.com/'

要将不同运行的值一起存储在一个 dict/json 中,您可以使用在 https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter所以你有类似的东西:

To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter So you have something like:

Class WriteAllRequests(object):
    def __init__(self):
        self.urldic={}

    def process_item(self, item, spider):
        urldic[item.url]={item.urlprefix=item.urlstatus}
        if len(urldic[item.url])==4:
            # think this can be passed to a standard pipeline with a higher number
            writedata (urldic[item.url])

            del urldic[item.url]

您必须另外激活管道

这篇关于如何使用scrapy检查网站是否支持http、htts和www前缀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆