Python Scrapy-基于mimetype的过滤器,可避免非文本文件的下载 [英] Python Scrapy - mimetype based filter to avoid non-text file downloads
问题描述
我有一个运行中的scrapy项目,但是它占用大量带宽,因为它尝试下载许多二进制文件(zip,tar,mp3,.. etc).
I have a running scrapy project, but it is being bandwidth intensive because it tries to download a lot of binary files (zip, tar, mp3, ..etc).
我认为最好的解决方案是根据mimetype(Content-Type :) HTTP标头过滤请求.我查看了scrapy代码,发现了以下设置:
I think the best solution is to filter the requests based on the mimetype (Content-Type:) HTTP header. I looked at the scrapy code and found this setting:
DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
我将其更改为: DOWNLOADER_HTTPCLIENTFACTORY ='myproject.webclients.ScrapyHTTPClientFactory'
I changed it to: DOWNLOADER_HTTPCLIENTFACTORY = 'myproject.webclients.ScrapyHTTPClientFactory'
并用ScrapyHTTPPageGetter
播放了一下,以下是突出显示的编辑内容:
And played a little with the ScrapyHTTPPageGetter
, here is the edits highlighted:
class ScrapyHTTPPageGetter(HTTPClient):
# this is my edit
def handleEndHeaders(self):
if 'Content-Type' in self.headers.keys():
mimetype = str(self.headers['Content-Type'])
# Actually I need only the html, but just in
# case I've preserved all the text
if mimetype.find('text/') > -1:
# Good, this page is needed
self.factory.gotHeaders(self.headers)
else:
self.factory.noPage(Exception('Incorrect Content-Type'))
我觉得这是错误的,在确定它是不需要的mimetype后,我需要一种更轻松的方式来取消/删除请求.而不是等待整个数据被下载.
I feel this is wrong, I need more scrapy friendly way to cancel/drop request right after determining that it's unwanted mimetype. Instead of waiting for the whole data to be downloaded.
修改:
我专门要求这部分self.factory.noPage(Exception('Incorrect Content-Type'))
是取消请求的正确方法.
I'm asking specifically for this part self.factory.noPage(Exception('Incorrect Content-Type'))
is that the correct way to cancel a request.
更新1:
我当前的设置已使Scrapy服务器崩溃,所以请不要尝试使用上面相同的代码来解决问题.
Update 1:
My current setup have crashed the Scrapy server, so please don't try to use the same code above to solve the problem.
更新2:
我使用以下结构设置了一个基于Apache的网站进行测试:
Update 2:
I have setup an Apache-based website for testing using the following structure:
/var/www/scrapper-test/Zend -> /var/www/scrapper-test/Zend.zip (symlink)
/var/www/scrapper-test/Zend.zip
我注意到Scrapy丢弃了扩展名为.zip
的扩展名,但是删除了不带.zip的扩展名,即使它只是一个符号链接.
I have noticed that Scrapy discards the ones with the .zip
extension, but scraps the one without .zip even though it's just a symbolic link to it.
推荐答案
解决方案是设置Node.js
代理并将Scrapy配置为通过http_proxy
环境变量使用它.
The solution is to setup a Node.js
proxy and configure Scrapy to use it through http_proxy
environment variable.
代理应该做什么:
- 从Scrapy接收HTTP请求,并将其发送到正在爬网的服务器.然后它将响应发回给Scrapy,即拦截所有HTTP流量.
- 对于二进制文件(基于您实施的启发式方法),它会向Scrapy发送
403 Forbidden
错误,并立即关闭请求/响应.这有助于节省时间,流量和Scrapy不会崩溃.
- Take HTTP requests from Scrapy and sends it to the server being crawled. Then it gives back the response from to Scrapy i.e. intercept all HTTP traffic.
- For binary files (based on a heuristic you implement) it sends
403 Forbidden
error to Scrapy and immediate closes the request/response. This helps to save time, traffic and Scrapy won't crash.
那确实有效!
http.createServer(function(clientReq, clientRes) {
var options = {
host: clientReq.headers['host'],
port: 80,
path: clientReq.url,
method: clientReq.method,
headers: clientReq.headers
};
var fullUrl = clientReq.headers['host'] + clientReq.url;
var proxyReq = http.request(options, function(proxyRes) {
var contentType = proxyRes.headers['content-type'] || '';
if (!contentType.startsWith('text/')) {
proxyRes.destroy();
var httpForbidden = 403;
clientRes.writeHead(httpForbidden);
clientRes.write('Binary download is disabled.');
clientRes.end();
}
clientRes.writeHead(proxyRes.statusCode, proxyRes.headers);
proxyRes.pipe(clientRes);
});
proxyReq.on('error', function(e) {
console.log('problem with clientReq: ' + e.message);
});
proxyReq.end();
}).listen(8080);
这篇关于Python Scrapy-基于mimetype的过滤器,可避免非文本文件的下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!