使用 Scrapy 进行 NTLM 身份验证以进行网页抓取 [英] NTLM authentication with Scrapy for web scraping

查看:54
本文介绍了使用 Scrapy 进行 NTLM 身份验证以进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从需要身份验证的网站上抓取数据.
我已经能够使用请求和 HttpNtlmAuth 成功登录,如下所示:

I am attempting to scrape data from a website that requires authentication.
I have been able to successfully login using requests and HttpNtlmAuth with the following:

s = requests.session()     
url = "https://website.com/things"                                                      
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))

我想探索 Scrapy 的功能,但是我无法成功进行身份验证.

I would like to explore the capabilities of Scrapy, however I have not been able to successfully authenticate.

我发现以下中间件似乎可以工作,但我认为我没有正确实施它:

I came across the following middleware which seems like it could work but I do not think I have been implementing it properly:

https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py

在我的 settings.py 中有

In my settings.py I have

SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }

在我的蜘蛛课上

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

我一直无法让它发挥作用.

I have not been able to get this to work.

如果有人能够通过 NTLM 身份验证成功地从网站上抓取数据,可以为我指明正确的方向,我将不胜感激.

If anyone has successfully been able to scrape from a website with NTLM authentication can point me in the right direction, I would appreciate it.

推荐答案

我能够弄清楚发生了什么.

I was able to figure out what was going on.

1:这被认为是DOWNLOADER_MIDDLEWARE"而不是SPIDER_MIDDLEWARE".

1: This is considered a "DOWNLOADER_MIDDLEWARE" not a "SPIDER_MIDDLEWARE".

DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }

2:我尝试使用的中间件需要进行重大修改.以下是对我有用的方法:

2: The middleware which I was trying to use needed to be modified significantly. Here is what works for me:

from scrapy.http import Response
import requests                                                              
from requests_ntlm import HttpNtlmAuth

class NTLM_Middleware(object):

    def process_request(self, request, spider):
        url = request.url
        pwd = getattr(spider, 'http_pass', '')
        usr = getattr(spider, 'http_user', '')
        s = requests.session()     
        response = s.get(url,auth=HttpNtlmAuth(usr,pwd))      
        return Response(url,response.status_code,{}, response.content)

在蜘蛛中,您需要做的就是设置这些变量:

Within the spider, all you need to do is set these variables:

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

这篇关于使用 Scrapy 进行 NTLM 身份验证以进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆