Scrapy 抓取 start_url 中的所有网站,即使重定向 [英] Scrapy Crawl all websites in start_url even if redirect

查看:75
本文介绍了Scrapy 抓取 start_url 中的所有网站,即使重定向的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一长串网站.start_url 列表中的一些网站重定向 (301).我希望scrapy从start_url列表中抓取重定向的网站,就好像它们也在allowed_domain列表中一样(它们不是).例如,example.com 在我的 start_url 列表中,并且允许域列表和 example.com 重定向到 foo.com.我想抓取 foo.com.

I am trying to crawl a long list of websites. Some of the websites in the start_url list redirect (301). I want scrapy to crawl the redirected websites from start_url list as if they were also on the allowed_domain list (which they are not). For example, example.com was on my start_url list and allowed domain list and example.com redirects to foo.com. I want to crawl foo.com.

DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>

我尝试在 parse_start_url 方法中动态添加 allowed_domains 并返回一个 Request 对象,以便scrapy一旦进入允许的域列表就会返回并抓取重定向的网站,但我仍然得到:

I tried dynamically adding allowed_domains in the parse_start_url method and return a Request object so that scrapy will go back and scrape the redirected websites once it is on the allowed domain list, but I still get:

 DEBUG: Filtered offsite request to 'www.foo.com'

这是我尝试动态添加 allowed_domains:

Here is my attempt to dynamically add allowed_domains:

def parse_start_url(self,response):
    domain = tldextract.extract(str(response.request.url)).registered_domain
    if domain not in self.allowed_domains:
        self.allowed_domains.append(domain)
        return Request = (response.url,callback=self.parse_callback)
    else:
        return self.parse_it(response,1)

我的其他想法是尝试在蜘蛛中间件 offsite.py 中创建一个函数,为源自 start_urls 的重定向网站动态添加 allowed_domains,但我也无法使该解决方案起作用.

My other ideas were to try and create a function in the spidermiddleware offsite.py that dynamically adds allowed_domains for redirected websites that originated from start_urls, but I have not been able to get that solution to work either.

推荐答案

我找到了自己问题的答案.

I figured out the answer to my own question.

我编辑了异地中间件以在过滤之前获取更新的允许域列表,并在 parse_start_url 方法中动态添加到允许域列表中.

I edited the offsite middleware to get the updated list of allowed domains before it filters and I dynamically add to the allowed domain list in parse_start_url method.

我将此功能添加到了 OffisteMiddleware

I added this function to OffisteMiddleware

def update_regex(self,spider):
    self.host_regex = self.get_host_regex(spider)

我也在 OffsiteMiddleware 中编辑了这个函数

I also edited this function inside OffsiteMiddleware

def should_follow(self, request, spider):
    #Custom code to update regex
    self.update_regex(spider)

    regex = self.host_regex
    # hostname can be None for wrong urls (like javascript links)
    host = urlparse_cached(request).hostname or ''
    return bool(regex.search(host))

最后在我的用例中,我将此代码添加到我的蜘蛛中

Lastly for my use case I added this code to my spider

def parse_start_url(self,response):
    domain = tldextract.extract(str(response.request.url)).registered_domain
    if domain not in self.allowed_domains:
        self.allowed_domains.append(domain)
    return self.parse_it(response,1)

此代码将为任何被重定向的 start_url 添加重定向的域,然后将抓取这些重定向的站点.

This code will add the redirected domain for any start_urls that get redirected and then will crawl those redirected sites.

这篇关于Scrapy 抓取 start_url 中的所有网站,即使重定向的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆