在发送请求之前重写scrapy URLs [英] Rewrite scrapy URLs before sending the request

查看：40 发布时间：2021/7/16 22:04:07 python scrapy

本文介绍了在发送请求之前重写scrapy URLs的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 scrapy 来抓取多语言网站.对于每个对象，存在三种不同语言的版本.我使用搜索作为起点.不幸的是，搜索包含各种语言的 URL，这会导致解析时出现问题.

I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.

因此，我想在 URL 发送之前对其进行预处理.如果它们包含特定字符串，我想替换 URL 的那部分.

Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.

我的蜘蛛扩展了CrawlSpider.我查看了文档并找到了 make_request_from _url(url) 方法，这导致了这次尝试:

My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:

def make_requests_from_url(self, url):                                                          
    """                                                                                         
    Override the original function go make sure only german URLs are                            
    being used. If french or italian URLs are detected, they're                                 
    rewritten.                                                                                  

    """                                                                                         
    if '/f/suche' in url:                                                                       
        self.log('French URL was rewritten: %s' % url)                                          
        url = url.replace('/f/suche/pages/', '/d/suche/seiten/')                                
    elif '/i/suche' in url:                                                                     
        self.log('Italian URL was rewritten: %s' % url)                                            
        url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')                                  
    return super(MyMultilingualSpider, self).make_requests_from_url(url)

但是由于某种原因这不起作用.在请求 URL 之前重写 URL 的最佳方法是什么?也许通过规则回调?

But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?

在发送请求之前重写scrapy URLs [英] Rewrite scrapy URLs before sending the request

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在发送请求之前重写scrapy URLs [英] Rewrite scrapy URLs before sending the request

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭