如何在使用scrapy爬行时跳过某些文件类型? [英] how to skip some file type while crawling with scrapy?

查看:100
本文介绍了如何在使用scrapy爬行时跳过某些文件类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在使用scrapy 爬行时跳过一些文件类型链接.exe .zip .pdf,但不想使用具有特定url 规则的规则.怎么样?

I want to skip some file type link .exe .zip .pdf while crawling with scrapy, but don't want to use Rule with specific url regular. How?

更新:

因此,当正文尚未下载时,很难决定是否仅通过 Content-Type 响应此链接.我更改为在下载器中间件中删除 url.谢谢彼得和利奥.

Due to that it's hard to decide whether to follow this link just by Content-Type in response when the body hasn't been downloaded. I change to drop url in downloader middleware. thanks Peter and Leo.

推荐答案

如果您转到 Scrapy 根目录中的 linkextractor.py,您将看到以下内容:

If you go to linkextractor.py within the Scrapy root directory, you will see the following:

"""
Common code and definitions used by Link extractors (located in
scrapy.contrib.linkextractor).
"""

# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
    # images
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

    # audio
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # video
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
    'm4a',

    # other
    'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar',
]

但是,由于这适用于链接提取器(并且您不想使用规则),我不确定这是否会解决您的问题(我刚刚意识到您指定了您不想使用规则.我以为您已经询问如何更改文件扩展名限制而无需直接在规则中指定).

However, since this applies to the linkextractor (and you don't want to use Rules), I am not sure that this will solve your problem (I just realized you specified that you didn't want to use Rules. I thought you had asked how to change the file-extension restrictions without needing to specify directly in a rule).

好消息是,您还可以构建自己的下载器中间件,并删除任何/所有对具有不良扩展名的 url 的请求.请参阅下载中间件

The good news is, you can also build your own downloader middleware and drop any/all requests to urls which have an undesirable extension. See Downloader Middlerware

您可以通过访问request对象的url属性来获取请求的url,如下所示:request.url

You can get the requested url by accessing the request object's url attribute as follows: request.url

基本上,在字符串的末尾搜索.exe"或任何您想删除的扩展名,如果它包含所述扩展名,则返回 IgnoreRequest 异常,该请求将立即被删除.

Basically, search the end of the string for '.exe' or whatever extension you want to drop, and if it contains said extentions, return an IgnoreRequest exception, and the request will immediately be dropped.

更新

为了在下载之前处理请求,您需要确保在自定义下载器中间件中定义process_request"方法.

In order to process the request prior to it being downloaded, you need to make sure you define the 'process_request' method within your custom downloader middleware.

根据 Scrapy 文档

According to the Scrapy documentation

process_request

对于通过下载的每个请求都会调用此方法中间件.

This method is called for each request that goes through the download middleware.

process_request() 应该返回 None、一个 Response 对象或一个请求对象.

process_request() should return either None, a Response object, or a Request object.

如果返回 None,Scrapy 会继续处理这个请求,执行所有其他中间件,直到最后,适当的下载处理程序被称为执行的请求(及其响应已下载).

If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

如果它返回一个 Response 对象,Scrapy 不会调用任何其他对象请求或异常中间件,或适当的下载功能;它会返回那个响应.响应中间件总是调用每个响应.

If it returns a Response object, Scrapy won’t bother calling ANY other request or exception middleware, or the appropriate download function; it’ll return that Response. Response middleware is always called on every Response.

如果返回一个Request对象,返回的请求将是重新安排(在调度程序中)以供将来下载.这将始终调用原始请求的回调.如果新请求有一个回调,它将被响应调用下载,然后该回调的输出将传递给原始回调.如果新请求没有回调,则下载的响应将传递给原始请求回调.

If it returns a Request object, the returned request will be rescheduled (in the Scheduler) to be downloaded in the future. The callback of the original request will always be called. If the new request has a callback it will be called with the response downloaded, and the output of that callback will then be passed to the original callback. If the new request doesn’t have a callback, the response downloaded will be just passed to the original request callback.

如果返回 IgnoreRequest 异常,则整个请求将完全丢弃并且它的回调从未调用过.

If it returns an IgnoreRequest exception, the entire request will be dropped completely and its callback never called.

所以本质上,只需创建一个下载器类,添加一个方法类process_request,它以请求对象和蜘蛛对象为参数.如果 url 包含不需要的扩展,则返回 IgnoreRequest 异常.

So essentially, just create a downloader class, add a method class process_request, which takes a request object and spider object as parameters. Then return the IgnoreRequest exception if the url contains unwanted extensions.

这一切都应该在下载页面之前发生.但是,如果您想要处理响应标头,则必须向网页发出请求.

This should all occur prior to the page being downloaded. However, if you are wanting to process the response headers instead, than a request will have to be made to the webpage.

您始终可以在下载器中同时实现 process_request 和 process_response 方法,其想法是立即删除明显的扩展名,并且如果由于某种原因 url 不包含文件扩展名,则请求将被处理并在 process_request 方法中捕获(因为您可以在标头中进行验证)?

You could always implement both a process_request and process_response method in the downloader, with the idea being that obvious extensions will immediately be dropped, and than, if for some reason the url did not contain the file extension, the request would be process and caught in the process_request method (since you could verify in the headers)?

这篇关于如何在使用scrapy爬行时跳过某些文件类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆