Python Scrapy-基于mimetype的过滤器,可避免非文本文件的下载 [英] Python Scrapy - mimetype based filter to avoid non-text file downloads

查看:205
本文介绍了Python Scrapy-基于mimetype的过滤器,可避免非文本文件的下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行中的scrapy项目,但是它占用大量带宽,因为它尝试下载许多二进制文件(zip,tar,mp3,.. etc).

I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).

我认为最好的解决方案是根据mimetype(Content-Type :) HTTP标头过滤请求.我查看了scrapy代码,发现了以下设置:

I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

我将其更改为: DOWNLOADER_HTTPCLIENTFACTORY ='myproject.webclients.ScrapyHTTPClientFactory'

I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'

并用ScrapyHTTPPageGetter播放了一下,以下是突出显示的编辑内容:

And played a little with the ScrapyHTTPPageGetter, here is the edits highlighted:

class ScrapyHTTPPageGetter(HTTPClient):
    # this is my edit
    def handleEndHeaders(self):
        if 'Content-Type' in self.headers.keys():
            mimetype = str(self.headers['Content-Type'])
            # Actually I need only the html, but just in 
            # case I've preserved all the text
            if mimetype.find('text/') > -1: 
                # Good, this page is needed
                self.factory.gotHeaders(self.headers)
            else:
                self.factory.noPage(Exception('Incorrect Content-Type'))

我觉得这是错误的,在确定它是不需要的mimetype后,我需要一种更轻松的方式来取消/删除请求.而不是等待整个数据被下载.

I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.

修改:
我专门要求这部分self.factory.noPage(Exception('Incorrect Content-Type'))是取消请求的正确方法.


I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type')) is that the correct way to cancel a request.

更新1:
我当前的设置已使Scrapy服务器崩溃,所以请不要尝试使用上面相同的代码来解决问题.

Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.

更新2:
我使用以下结构设置了一个基于Apache的网站进行测试:

Update 2:
I have setup an Apache-based website for testing using the following structure:

/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip

我注意到Scrapy丢弃了扩展名为.zip的扩展名,但是删除了不带.zip的扩展名,即使它只是一个符号链接.

I have noticed that Scrapy discards the ones with the .zip extension, but scraps the one without .zip even though it's just a symbolic link to it.

推荐答案

解决方案是设置Node.js代理并将Scrapy配置为通过http_proxy环境变量使用它.

The solution is to setup a Node.js proxy and configure Scrapy to use it through http_proxy environment variable.

代理应该做什么:

  • 从Scrapy接收HTTP请求,并将其发送到正在爬网的服务器.然后它将响应发回给Scrapy,即拦截所有HTTP流量.
  • 对于二进制文件(基于您实施的启发式方法),它会向Scrapy发送403 Forbidden错误,并立即关闭请求/响应.这有助于节省时间,流量和Scrapy不会崩溃.
  • Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.
  • For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.

那确实有效!

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

这篇关于Python Scrapy-基于mimetype的过滤器,可避免非文本文件的下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆