如何使用scrapy检查网站是否支持http、htts和www前缀 [英] How check if website support http, htts and www prefix with scrapy
问题描述
当我使用 http://example.com
、https://example.com
或 <代码>http://www.example.com.当我创建scrapy请求时,它工作正常.例如,在我的 page1.com
上,它总是被重定向到 https://
.我需要将此信息作为返回值获取,或者是否有更好的方法来使用 scrapy 获取此信息?
I am using scrapy to check, if some website works fine, when I use http://example.com
, https://example.com
or http://www.example.com
. When I create scrapy request, it works fine. for example, on my page1.com
, it is always redirected to https://
. I need to get this information as return value, or is there better way how to get this information using scrapy?
class myspider(scrapy.Spider):
name = 'superspider'
start_urls = [
"https://page1.com/"
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
url = response.url
# removing all possible prefixes from url
for remove in ['https://', 'http://', 'www.']:
url = str(url).replace(remove, '').rstrip('/')
# Try with all possible prefixes
for prefix in ['http://', 'http://www.', 'https://', 'https://www.']:
yield scrapy.Request(url='{}{}'.format(prefix, url), callback=self.test, dont_filter=True)
def test(self, response):
print(response.url, response.status)
这个蜘蛛的输出是这样的:
The output of this spider is this:
https://page1.com 200
https://page1.com/ 200
https://page1.com/ 200
https://page1.com/ 200
这很好,但我想将此信息作为返回值来了解,例如http
上是响应代码 200,然后将其保存到字典中供以后处理或将其保存为 json 到文件(使用scrapy 中的项目).
This is nice, but I would like to get this information as return value to know, that e.g. on http
is response code 200 and than save it to dictionary for later processing or save it as json to file(using items in scrapy).
期望的输出:我想要一个名为 a
的字典,其中包含所有信息:
DESIRED OUTPUT:
I would like to have dictionary named a
with all information:
print(a)
{'https://': True, 'http://': True, 'https://www.': True, 'http://www.': True}
稍后我想抓取更多信息,所以我需要将所有信息存储在一个 object/json/...
Later I would like to scrape more information, so I will need to store all information under one object/json/...
推荐答案
您可以解析 request.url,而不是使用 eLRuLL 指出的元可能性:
Instead of using the meta possibility which was pointed out by eLRuLL you can parse request.url:
scrapy shell http://stackoverflow.com
In [1]: request.url
Out[1]: 'http://stackoverflow.com'
In [2]: response.url
Out[2]: 'https://stackoverflow.com/'
要将不同运行的值一起存储在一个 dict/json 中,您可以使用在 https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter所以你有类似的东西:
To store the values for different runs together in one dict/json you can use an additional pipeline like mentioned in https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter So you have something like:
Class WriteAllRequests(object):
def __init__(self):
self.urldic={}
def process_item(self, item, spider):
urldic[item.url]={item.urlprefix=item.urlstatus}
if len(urldic[item.url])==4:
# think this can be passed to a standard pipeline with a higher number
writedata (urldic[item.url])
del urldic[item.url]
您必须另外激活管道
这篇关于如何使用scrapy检查网站是否支持http、htts和www前缀的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!