在发送请求之前重写scrapy URLs [英] Rewrite scrapy URLs before sending the request

查看:40
本文介绍了在发送请求之前重写scrapy URLs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 来抓取多语言网站.对于每个对象,存在三种不同语言的版本.我使用搜索作为起点.不幸的是,搜索包含各种语言的 URL,这会导致解析时出现问题.

I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.

因此,我想在 URL 发送之前对其进行预处理.如果它们包含特定字符串,我想替换 URL 的那部分.

Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.

我的蜘蛛扩展了CrawlSpider.我查看了文档并找到了 make_request_from _url(url) 方法,这导致了这次尝试:

My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:

def make_requests_from_url(self, url):                                                          
    """                                                                                         
    Override the original function go make sure only german URLs are                            
    being used. If french or italian URLs are detected, they're                                 
    rewritten.                                                                                  

    """                                                                                         
    if '/f/suche' in url:                                                                       
        self.log('French URL was rewritten: %s' % url)                                          
        url = url.replace('/f/suche/pages/', '/d/suche/seiten/')                                
    elif '/i/suche' in url:                                                                     
        self.log('Italian URL was rewritten: %s' % url)                                            
        url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')                                  
    return super(MyMultilingualSpider, self).make_requests_from_url(url)                                                  

但是由于某种原因这不起作用.在请求 URL 之前重写 URL 的最佳方法是什么?也许通过规则回调?

But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?

推荐答案

因为你已经扩展了 CrawlSpider 你可以使用 process_links() 来处理你的链接提取的 URL提取器(或 process_requests(),如果您更喜欢在请求级别工作),详细此处

As you already extend CrawlSpider you can use process_links() to process the URL extracted by your link extractors (or process_requests() if you prefer working at the Request level), detailed here

这篇关于在发送请求之前重写scrapy URLs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆