在 Scrapy 蜘蛛中动态添加到 allowed_domains [英] Dynamically add to allowed_domains in a Scrapy spider

查看：38 发布时间：2021/7/17 18:30:40 python screen-scraping scrapy

本文介绍了在 Scrapy 蜘蛛中动态添加到 allowed_domains的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个爬虫，它在爬虫开始时以一小部分 allowed_domains 开头.我需要在解析器中继续进行爬取时，将更多域动态添加到此白名单中，但由于后续请求仍在过滤中，因此以下代码无法完成.解析器中是否还有更新 allowed_domains 的内容?

I have a spider that starts with a small list of allowed_domains at the beginning of the spidering. I need to add more domains dynamically to this whitelist as the spidering continues from within a parser, but the following piece of code does not get that accomplished since subsequent requests are still being filtered. Is there another of updating allowed_domains within the parser?

class APSpider(BaseSpider):
name = "APSpider"

allowed_domains = ["www.somedomain.com"]

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

...

def parse(self, response):
    soup = BeautifulSoup( response.body )

    for link_tag in soup.findAll('td',{'class':'half-width'}):
        _website = link_tag.find('a')['href']
        u = urlparse.urlparse(_website)
        self.allowed_domains.append(u.netloc)

        yield Request(url=_website, callback=self.parse_secondary_site)

...

推荐答案

您可以尝试以下操作:

class APSpider(BaseSpider):
name = "APSpider"

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

def __init__(self):
    self.allowed_domains = None

def parse(self, response):
    soup = BeautifulSoup( response.body )

    if not self.allowed_domains:
        for link_tag in soup.findAll('td',{'class':'half-width'}):
            _website = link_tag.find('a')['href']
            u = urlparse.urlparse(_website)
            self.allowed_domains.append(u.netloc)

            yield Request(url=_website, callback=self.parse_secondary_site)

    if response.url in self.allowed_domains:
        yield Request(...)

...

这篇关于在 Scrapy 蜘蛛中动态添加到 allowed_domains的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Scrapy 蜘蛛中动态添加到 allowed_domains [英] Dynamically add to allowed_domains in a Scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在 Scrapy 蜘蛛中动态添加到 allowed_domains [英] Dynamically add to allowed_domains in a Scrapy spider

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭