下载文件时 Scrapy i/o 块 [英] Scrapy i/o block when downloading files

查看:34
本文介绍了下载文件时 Scrapy i/o 块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Scrapy 抓取网页并下载一些文件.由于我得到的 file_url 将重定向到另一个 url(302 重定向).所以我使用另一种方法 handle_redirect 来获取重定向的 url.我像这样自定义文件管道.

class MyFilesPipeline(FilesPipeline):def handle_redirect(self, file_url):响应 = requests.head(file_url)如果 response.status_code == 302:file_url = response.headers["位置"]返回 file_urldef get_media_requests(self, item, info):redirect_url = self.handle_redirect(item["file_urls"][0])产生scrapy.Request(redirect_url)def item_completed(self, results, item, info):file_paths = [x['path'] for ok, x in results if ok]如果不是 file_paths:raise DropItem("项目不包含图像")item['file_urls'] = file_paths归还物品

通过上面的代码,我可以下载文件,但是下载的时候进程被阻塞了,所以整个项目变得很慢.

我在蜘蛛中尝试了另一种解决方案,使用请求首先获得重定向的 url,然后传递给另一个函数.并使用默认的文件管道.

yield scrapy.Request(下载网址[0],元={名称":名称,},dont_filter=真,回调=self.handle_redirect)def handle_redirect(self, response):logging.warning("respon %s" % response.meta)download_url = response.headers["位置"].decode("utf-8")返回 AppListItem(name=response.meta["name"],file_urls=[download_url],)

仍然阻塞进程.

从这里开始

在 302 上,我不确定您到底要做什么.我认为你只是在关注他们——这只是你手动完成的.我认为在扩展允许的代码时,scrapy 会为您做到这一点.FilesPipeline 可能无法从 handle_httpstatus_list 获取值,但全局设置 HTTPERROR_ALLOWED_CODES 也会影响 FilesPipeline.>

无论如何,requests 是一个糟糕的选择,因为它会阻塞 = 绝对非常糟糕的性能.yielding Scrapy Request 将让他们走开"(暂时)但您将再次遇到他们",因为他们使用相同的资源,调度程序和下载器进行实际下载.这意味着它们很可能会降低您的性能……这是一件好事.我知道您在这里需要快速爬行,但是scrapy 希望您意识到自己在做什么以及何时设置了并发限制,例如8 或 16,您相信scrapy 不会以高于该比率的速度淹没"您的目标站点.Scrapy 会采取悲观的假设,即由同一服务器/域提供的媒体文件是到其 Web 服务器(而不是某些 CDN)的流量,并将应用相同的限制以保护目标站点和您.否则,想象一个页面中恰好有 1000 张图像.如果您以某种方式免费"获得这 1000 次下载,您将并行向服务器发出 8000 个请求,并发设置为 8 - 不是一件好事.

如果您想免费"获得一些下载,即不遵守并发限制的下载,您可以使用 treq.这是 Twisted 框架的请求包.这里在管道中使用它.我更愿意使用它来访问我拥有的 API 或 Web 服务器,而不是 3rd 方服务器.

I using Scrapy to scrapy a webside and download some files. Since the file_url I get will redirect to another url (302 redirect).So I use another method handle_redirect to get the redirected url. I custom the filespipeline like this.

class MyFilesPipeline(FilesPipeline):

    def handle_redirect(self, file_url):
        response = requests.head(file_url)
        if response.status_code == 302:
            file_url = response.headers["Location"]
        return file_url

    def get_media_requests(self, item, info):
        redirect_url = self.handle_redirect(item["file_urls"][0])
        yield scrapy.Request(redirect_url)

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no images")
        item['file_urls'] = file_paths
        return item

With the code above, I can download the file, but the process were block when downloading, so the whole project become very slow.

I tried another solution in spider, use Requests get redirected url first then pass to another function.and use the default filespipeline.

yield scrapy.Request(
            download_url[0],
            meta={
                "name": name,
                },
            dont_filter=True,
            callback=self.handle_redirect)

    def handle_redirect(self, response):
        logging.warning("respon %s" % response.meta)
        download_url = response.headers["Location"].decode("utf-8")

        return AppListItem(
            name=response.meta["name"],
            file_urls=[download_url],
            )

Still block the process.

From the dos here

Using the Files Pipeline

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains "locked" at that particular pipeline stage until the files have finish downloading (or fail for some reason)

Is this mean I can't scrapy next url until the file had been downloaded? (I don't set download_delay in my settings)

EDIT

I already added this at first:

handle_httpstatus_list = [302]

so I will not be redirect to the redirected url, My first solution use requests is because I think yield will work like this:

  1. I crawl a page, keep yield callback, then call return item
  2. The item pass to the pipeline and If it meet some i/o, it will yield to spider to crawl next page like normal Asynchronous i/o do.

Or I have to wait for the files downloaded so I can crawl next page? Is it the downside of Scrapy? The second part I don't follow is how to calculate the speed of crawl the page.For instance, 3s for a complete page, with a default concurrency of 16.I guess @neverlastn use 16/2/3 to get 2.5 pages/s.Doesn't concurrency 16 means I can handle 16 request at the same time? So the speed should be 16 pages/s? Please correct if I wrong.

Edit2

Thanks for your answer, I understand how to calculate now,But I still don't understand the second part.On 302 I first meet this problem. Error 302 Downloading File in Scrapy I have a url like

http://example.com/first

which will use 302 and redirect to

http://example.com/second

but Scrapy don't auto redirect to the second one, and can not download the file which is wired. From the code here Scrapy-redirect and dos here RedirectMiddleware points out that scrapy should handle redirect by default. That is why I do some trick and trying to fix it.My third solution will try to use Celery like this

class MyFilesPipeline(FilesPipeline):
    @app.task
    def handle_redirect(self, file_url):
        response = requests.head(file_url)
        if response.status_code == 302:
            file_url = response.headers["Location"]
        return file_url

    def get_media_requests(self, item, info):
        redirect_url = self.handle_redirect.delay(item["file_urls"][0])
        yield scrapy.Request(redirect_url)

    def item_completed(self, results, item, info):
        file_paths = [x['path'] for ok, x in results if ok]
        if not file_paths:
            raise DropItem("Item contains no images")
        item['file_urls'] = file_paths
        return item

Since I have lot of spider already, I don't want to override them use the second solution. So I handle them in the pipeline,Is this solution will be better?

解决方案

You use the requests API which is synchronous/blocking. This means that you turn your concurrency (CONCURRENT_REQUESTS_PER_DOMAIN) from (by default) 8, to effectively one. It seems like it dominates your delay. Nice trick the one you did on your second attempt. This doesn't use requests thus it should be faster compared to using requests (isn't it?) Now, of course you add extra delay... If your first (HTML) request takes 1s and the second (image) request 2s, overall you have 3s for a complete page. With a default concurrency of 16, this would mean that you would crawl about 2.5 pages/s. When your redirect fails and you don't crawl the image, the process would take aprox. 1s i.e. 8 pages/s. So you might see a 3x slowdown. One solution might be to 3x the number of concurrent requests you allow to run in parallel by increasing CONCURRENT_REQUESTS_PER_DOMAIN and/or CONCURRENT_REQUESTS. If you are now running this from a place with limited bandwidth and/or increased latency, another solution might be to run it from a cloud server closer to the area where the image servers are hosted (e.g. EC2 US East).

EDIT

The performance is better understood by "little's law". 1st both CONCURRENT_REQUESTS_PER_DOMAIN and CONCURRENT_REQUESTS work typically in parallel. CONCURRENT_REQUESTS_PER_DOMAIN = 8 by default and I would guess that you typically download from a single domain thus your actual concurrency limit is 8. The level of concurrency (i.e. 8) isn't per second but it's a fixed number, like saying that "that oven can bake at most 8 croissants in it". How quickly your croissants bake is the latency (this is the web response time) and the metric you're interested in is their ratio i.e. 8 croissants can bake in parallel / 3 second per croissant = I will be baking 2.5 croissants/second.

On 302, I'm not sure what exactly you're trying to do. I think you're just following them - it's just that you do it manually. I think that scrapy will do this for you when extending the allowed codes. FilesPipeline might not get the value from handle_httpstatus_list but the global setting HTTPERROR_ALLOWED_CODES should affect the FilesPipeline as well.

Anyway, requests is a bad option because it blocks = definitely very bad performance. yielding Scrapy Requests will "get them out of the way" (for now) but you will "meet them" again because they use the same resource, the scheduler and the downloader to do the actual downloads. This means that they will highly likely slow down your performance... and this is a good thing. I understand that your need here is to crawl fast, but scrapy wants you to be conscious of what you're doing and when you set a concurrency limit of e.g. 8 or 16, you trust scrapy to not "flood" your target sites with higher than that rate. Scrapy will take the pessimistic assumption that your media files served by the same server/domain are traffic to their web server (instead of some CDN) and will apply the same limits in order to protect the target site and you. Otherwise, imagine a page that happens to have 1000 images in it. If you get those 1000 downloads somehow "for free", you will be doing 8000 requests to the server in parallel, with concurrency set to 8 - not a good thing.

If you want to get some downloads "for free" i.e. ones that don't adhere to the concurrency limits, you can use treq. This is the requests package for the Twisted framework. Here's how to use it in a pipeline. I would feel more comfortable using it for hitting API's or web servers I own, rather than 3rd party servers.

这篇关于下载文件时 Scrapy i/o 块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆