Python Scrapy-基于mimetype的过滤器，可避免非文本文件的下载 [英] Python Scrapy - mimetype based filter to avoid non-text file downloads

查看：205 发布时间：2020/5/9 21:50:14 python twisted mime-types scrapy

本文介绍了Python Scrapy-基于mimetype的过滤器，可避免非文本文件的下载的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个运行中的scrapy项目，但是它占用大量带宽，因为它尝试下载许多二进制文件(zip，tar，mp3，.. etc).

I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).

我认为最好的解决方案是根据mimetype(Content-Type :) HTTP标头过滤请求.我查看了scrapy代码，发现了以下设置:

I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'

我将其更改为: DOWNLOADER_HTTPCLIENTFACTORY ='myproject.webclients.ScrapyHTTPClientFactory'

I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'

并用ScrapyHTTPPageGetter播放了一下，以下是突出显示的编辑内容:

And played a little with the ScrapyHTTPPageGetter, here is the edits highlighted:

class ScrapyHTTPPageGetter(HTTPClient):
    # this is my edit
    def handleEndHeaders(self):
        if 'Content-Type' in self.headers.keys():
            mimetype = str(self.headers['Content-Type'])
            # Actually I need only the html, but just in 
            # case I've preserved all the text
            if mimetype.find('text/') > -1: 
                # Good, this page is needed
                self.factory.gotHeaders(self.headers)
            else:
                self.factory.noPage(Exception('Incorrect Content-Type'))

我觉得这是错误的，在确定它是不需要的mimetype后，我需要一种更轻松的方式来取消/删除请求.而不是等待整个数据被下载.

I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.

修改:
我专门要求这部分self.factory.noPage(Exception('Incorrect Content-Type'))是取消请求的正确方法.

I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type')) is that the correct way to cancel a request.

更新1:
我当前的设置已使Scrapy服务器崩溃，所以请不要尝试使用上面相同的代码来解决问题.

Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.

更新2:
我使用以下结构设置了一个基于Apache的网站进行测试:

Update 2:
I have setup an Apache-based website for testing using the following structure:

/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip

我注意到Scrapy丢弃了扩展名为.zip的扩展名，但是删除了不带.zip的扩展名，即使它只是一个符号链接.

I have noticed that Scrapy discards the ones with the .zip extension, but scraps the one without .zip even though it's just a symbolic link to it.

推荐答案

解决方案是设置Node.js代理并将Scrapy配置为通过http_proxy环境变量使用它.

The solution is to setup a Node.js proxy and configure Scrapy to use it through http_proxy environment variable.

代理应该做什么:

从Scrapy接收HTTP请求，并将其发送到正在爬网的服务器.然后它将响应发回给Scrapy，即拦截所有HTTP流量.
对于二进制文件(基于您实施的启发式方法)，它会向Scrapy发送403 Forbidden错误，并立即关闭请求/响应.这有助于节省时间，流量和Scrapy不会崩溃.

Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.
For binary files (based on a heuristic you implement) it sends 403 Forbidden error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.

那确实有效！

http.createServer(function(clientReq, clientRes) {
    var options = {
        host: clientReq.headers['host'],
        port: 80,
        path: clientReq.url,
        method: clientReq.method,
        headers: clientReq.headers
    };


    var fullUrl = clientReq.headers['host'] + clientReq.url;

    var proxyReq = http.request(options, function(proxyRes) {
        var contentType = proxyRes.headers['content-type'] || '';
        if (!contentType.startsWith('text/')) {
            proxyRes.destroy();            
            var httpForbidden = 403;
            clientRes.writeHead(httpForbidden);
            clientRes.write('Binary download is disabled.');
            clientRes.end();
        }

        clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
        proxyRes.pipe(clientRes);
    });

    proxyReq.on('error', function(e) {
        console.log('problem with clientReq: ' + e.message);
    });

    proxyReq.end();

}).listen(8080);

这篇关于Python Scrapy-基于mimetype的过滤器，可避免非文本文件的下载的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python Scrapy-基于mimetype的过滤器，可避免非文本文件的下载 [英] Python Scrapy - mimetype based filter to avoid non-text file downloads

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python Scrapy-基于mimetype的过滤器，可避免非文本文件的下载 [英] Python Scrapy - mimetype based filter to avoid non-text file downloads

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭