cra草验证码 [英] Scrapy & captcha

查看:120
本文介绍了cra草验证码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网站 https://www.barefootstudent.com/jobs 中使用scrapy提交表单页面上的任何链接等/ full_time_nanny_needed_in_venice_217021 )

I use scrapy for submit form in site https://www.barefootstudent.com/jobs (any links into page, etc http://www.barefootstudent.com/los_angeles/jobs/full_time/full_time_nanny_needed_in_venice_217021)

我的scapy机器人成功登录,但我无法避免验证码。
对于表单提交,我使用scrapy.FormRequest.from_reponse

My scapy bot successfully log in but i can not avoid captcha. For form submit i use scrapy.FormRequest.from_reponse

frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)

    yield frq

我要从此页面加载验证码图像,并手动输入脚本运行时。

i want load captcha image from this page, and manual input into script runtime. etc

captcha = raw_input("put captcha in manually>")  

我尝试

 urllib.urlretrieve(captcha, "./captcha.jpg")

但是此方法加载了错误的验证码(网站拒绝我的输入)。我尝试在一个运行脚本中反复调用urllib.urlretieve,每次他返回不同的验证码时:(

But this method load incorrect captcha (site reject my input). I try call urllib.urlretieve repeatedly in one run script and every time he returns the different captchas :(

之后,我尝试使用 ImagePipeline
但是我的问题是,即使我使用yeild,也只有在函数执行完后才出现返回项(下载图像)。

After that i tried use ImagePipeline. But my problem is that return item (downloading image) occurs only after the function has finished executed, even if I use yeild.

 item = BfsItem()
 item['image_urls'] = [captcha]
 yield item
 captcha = raw_input("put captcha in manually>")  
 frq = scrapy.FormRequest.from_response(response, formdata={'message': 'itttttttt', 
                                   'security': captcha, 'name': 'fx',
                                   'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                                   }, callback=self.afterForm)
 yield frq

那时,当我的脚本请求输入时,图片是

At that moment, when my script request input, the picture is not download!

我如何修改我的脚本并在手动输入验证码后可以调用FormRequest?

How i can modify my script and can call FormRequest after manual input captcha?

谢谢非常!

推荐答案

我正在使用且通常效果很好的方法看起来像这样(只是要点,您需要添加特定的详细信息):

The approach I am using and that usually works quite well looks like this (just a gist, you need to add your specific details):

步骤1-获取验证码网址(并保留表单的响应以备后用)

Step 1 - getting the captcha url (and keeping the form's response for later)

def parse_page_with_captcha(response):
    captcha_url = response.xpath(...)
    data_for_later = {'captcha_form': response} # store the response for later use
    return Request(captcha_url, callback=self.parse_captcha_download, meta=data_for_later)

第2步-现在scrapy将下载图像,我们必须在scrapy回调中对其进行正确处理

Step 2 - now scrapy will download the image and we have to process it properly in a scrapy callback

def parse_captcha_download(response):
    captcha_target_filename = 'filename.png'
    # save the image for processing
    i = Image.open(StringIO(response.body))
    i.save(captcha_target_filename)

    # process the captcha (OCR, or sending it to a decaptcha service, etc ...)
    captcha_text = solve_captcha(captcha_target_filename)

    # and now we have all the data we need for building the form request
    captcha_form = response.meta['captcha_form']

    return scrapy.FormRequest.from_response(captcha_form, formdata={'message': 'itttttttt', 
                               'security': captcha_text, 'name': 'fx',
                               'category_id': '2', 'email': 'ololo%40gmail.com', 'item_id': '216640_2', 'location': '18', 'send_message': 'Send%20Message'
                               }, callback=self.afterForm)

重要详细信息

受验证码保护的表单需要某种方式将验证码图像与看到并回答该验证码的特定用户/客户端链接。通常使用隐藏在验证码表单中的基于cookie的会话或特殊参数/图像令牌来完成此操作。

Captcha protected forms need some way to link captcha images with a particular user/client who saw and answered this captcha. This is usually done using cookie-based sessions or special parameters / image tokens hidden in the captcha form.

抓取代码必须小心不要破坏此链接,否则

The scraper code must be careful not to destroy this link, otherwise it will answer some captcha but not the captcha it has to.

为什么它不能与发布的两个示例Verz1Lka一起使用?

urllib.urlretrieve方法完全在scrapy之外起作用。虽然这通常是一个坏主意(这不会利用scrapys调度的好处),但是这里的主要问题是:此请求将完全在目标站点用来跟踪哪个会话cookie,url参数等之外的所有工作验证码已发送到特定的浏览器。

The urllib.urlretrieve approach works completely outside of scrapy. And while this is generally a bad idea (this won't use the benefits of scrapys scheduling etc), the major problem here is: this request will work completely outside of any session cookies, url parameters etc that the target site uses to track which captcha was sent to a particular browser.

另一方面,使用图像管道的方法在Scrapy的规则内运行良好,但是这些图像下载计划在稍后,因此验证码下载在需要时将不可用。

The approach using the image pipeline on the other hand is playing nicely inside Scrapy's rules, but these image downloads are scheduled to be done at a later time and so the captcha download won't be available when it is needed.

这篇关于cra草验证码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆