我如何设置Scrapy来处理验证码 [英] How do I set up Scrapy to deal with a captcha

查看:759
本文介绍了我如何设置Scrapy来处理验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个要求用户输入搜索值和验证码的网站。我有一个用于验证码的光学字符识别(OCR)例程,该例程成功率大约为33%。由于验证码始终是字母文本,因此如果OCR函数返回非字母字符,我想重新加载验证码。输入文字 word后,我想提交搜索表单。

I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I want to reload the captcha if the OCR function returns non-alphabetic characters. Once I have a text "word", I want to submit the search form.

搜索结果返回到同一页面,该表单可以用于新的搜索和新的验证码。因此,我需要冲洗并重复直到我用尽搜索词为止。

The results come back in the same page, with the form ready for a new search and a new captcha. So I need to rinse and repeat until I've exhausted my search terms.

这是最高级的算法:


  1. 最初加载页面

  2. 下载验证码映像,通过OCR运行它

  3. 如果OCR不t返回纯文本结果,刷新验证码并重复此步骤

  4. 在页面中提交包含搜索词和验证码的查询表单

  5. 检查响应以查看验证码是否正确

  6. 如果正确,则将数据抓取

  7. 转到2

  1. Load page initially
  2. Download the captcha image, run it through the OCR
  3. If the OCR doesn't come back with a text-only result, refresh the captcha and repeat this step
  4. Submit the query form in the page with search term and captcha
  5. Check the response to see whether the captcha was correct
  6. If it was correct, scrape the data
  7. Go to 2

我尝试使用管道来获取验证码,但是后来我没有提交表单的价值。如果我使用urllib或其他方法获取图像而不通过框架,则不会提交具有会话的cookie,因此服务器上的验证码验证失败。

I've tried using a pipeline for getting the captcha, but then I don't have the value for the form submission. If I just fetch the image without going through the framework, using urllib or something, then the cookie with the session is not submitted, so the captcha validation on the server fails.

执行此操作的理想的Scrapy方式是什么?

What's the ideal Scrapy way of doing this?

推荐答案

这是一个非常深入的话题,提供了很多解决方案。但是,如果您要应用帖子中定义的逻辑,则可以使用scrapy 下载中间件

It's a really deep topic with a bunch of solutions. But if you want to apply the logic you've defined in your post you can use scrapy Downloader Middlewares.

类似的东西:

class CaptchaMiddleware(object):
    max_retries = 5
    def process_response(request, response, spider):
        if not request.meta.get('solve_captcha', False):
            return response  # only solve requests that are marked with meta key
        catpcha = find_catpcha(response)
        if not captcha:  # it might not have captcha at all!
            return response
        solved = solve_captcha(captcha)
        if solved:
            response.meta['catpcha'] = captcha
            response.meta['solved_catpcha'] = solved
            return response
        else:
            # retry page for new captcha
            # prevent endless loop
            if request.meta.get('catpcha_retries', 0) == 5:
                logging.warning('max retries for captcha reached for {}'.format(request.url))
                raise IgnoreRequest 
            request.meta['dont_filter'] = True
            request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
            return request

此示例将拦截每个响应并尝试解决验证码。如果失败,它将重试该页面以获取新的验证码;如果失败,它将添加一些元密钥以响应已解决的验证码值。

在您的蜘蛛中,您将像这样使用它:

This example will intercept every response and try to solve the captcha. If failed it will retry the page for new captcha, if successful it will add some meta keys to response with solved captcha values.
In your spider you would use it like this:

class MySpider(scrapy.Spider):
    def parse(self, response):
        url = ''# url that requires captcha
        yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
                      errback=self.parse_fail)

    def parse_captchad(self, response):
        solved = response['solved']
        # do stuff

    def parse_fail(self, response):
        # failed to retrieve captcha in 5 tries :(
        # do stuff

这篇关于我如何设置Scrapy来处理验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆