我如何设置 Scrapy 来处理验证码 [英] How do I set up Scrapy to deal with a captcha

查看:55
本文介绍了我如何设置 Scrapy 来处理验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个需要用户输入搜索值和验证码的网站.我有一个用于验证码的光学字符识别 (OCR) 程序,大约 33% 的时间成功.由于验证码始终是字母文本,如果 OCR 函数返回非字母字符,我想重新加载验证码.一旦我有一个文本单词",我想提交搜索表单.

I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I want to reload the captcha if the OCR function returns non-alphabetic characters. Once I have a text "word", I want to submit the search form.

结果返回到同一页面,表单已准备好进行新的搜索和新的验证码.所以我需要冲洗并重复,直到我用尽我的搜索词.

The results come back in the same page, with the form ready for a new search and a new captcha. So I need to rinse and repeat until I've exhausted my search terms.

这是顶级算法:

  1. 初始加载页面
  2. 下载验证码图像,通过 OCR 运行
  3. 如果 OCR 没有返回纯文本结果,请刷新验证码并重复此步骤
  4. 在页面中提交带有搜索词和验证码的查询表单
  5. 检查响应以查看验证码是否正确
  6. 如果正确,则抓取数据
  7. 转到 2

我曾尝试使用管道获取验证码,但后来我没有表单提交的价值.如果我只是在不通过框架的情况下获取图像,使用 urllib 或其他东西,则不会提交带有会话的 cookie,因此服务器上的验证码验证失败.

I've tried using a pipeline for getting the captcha, but then I don't have the value for the form submission. If I just fetch the image without going through the framework, using urllib or something, then the cookie with the session is not submitted, so the captcha validation on the server fails.

这样做的理想 Scrapy 方式是什么?

What's the ideal Scrapy way of doing this?

推荐答案

这是一个非常深刻的话题,有很多解决方案.但是如果你想应用你在帖子中定义的逻辑,你可以使用scrapy 下载中间件.

It's a really deep topic with a bunch of solutions. But if you want to apply the logic you've defined in your post you can use scrapy Downloader Middlewares.

类似于:

class CaptchaMiddleware(object):
    max_retries = 5
    def process_response(request, response, spider):
        if not request.meta.get('solve_captcha', False):
            return response  # only solve requests that are marked with meta key
        catpcha = find_catpcha(response)
        if not captcha:  # it might not have captcha at all!
            return response
        solved = solve_captcha(captcha)
        if solved:
            response.meta['catpcha'] = captcha
            response.meta['solved_catpcha'] = solved
            return response
        else:
            # retry page for new captcha
            # prevent endless loop
            if request.meta.get('catpcha_retries', 0) == max_retries:
                logging.warning('max retries for captcha reached for {}'.format(request.url))
                raise IgnoreRequest 
            request.meta['dont_filter'] = True
            request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1
            return request
    

此示例将拦截每个响应并尝试解决验证码.如果失败,它将重试新验证码的页面,如果成功,它将添加一些元密钥以使用解决的验证码值进行响应.
在你的蜘蛛中,你会像这样使用它:

This example will intercept every response and try to solve the captcha. If failed it will retry the page for new captcha, if successful it will add some meta keys to response with solved captcha values.
In your spider you would use it like this:

class MySpider(scrapy.Spider):
    def parse(self, response):
        url = ''# url that requires captcha
        yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},
                      errback=self.parse_fail)
    
    def parse_captchad(self, response):
        solved = response['solved']
        # do stuff
    
    def parse_fail(self, response):
        # failed to retrieve captcha in 5 tries :(
        # do stuff

这篇关于我如何设置 Scrapy 来处理验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆