使用 Scrapy 进行 NTLM 身份验证以进行网页抓取 [英] NTLM authentication with Scrapy for web scraping
问题描述
我试图从需要身份验证的网站上抓取数据.
我已经能够使用请求和 HttpNtlmAuth 成功登录,如下所示:
I am attempting to scrape data from a website that requires authentication.
I have been able to successfully login using requests and HttpNtlmAuth with the following:
s = requests.session()
url = "https://website.com/things"
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))
我想探索 Scrapy 的功能,但是我无法成功进行身份验证.
I would like to explore the capabilities of Scrapy, however I have not been able to successfully authenticate.
我发现以下中间件似乎可以工作,但我认为我没有正确实施它:
I came across the following middleware which seems like it could work but I do not think I have been implementing it properly:
https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py
在我的 settings.py 中有
In my settings.py I have
SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }
在我的蜘蛛课上
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
我一直无法让它发挥作用.
I have not been able to get this to work.
如果有人能够通过 NTLM 身份验证成功地从网站上抓取数据,可以为我指明正确的方向,我将不胜感激.
If anyone has successfully been able to scrape from a website with NTLM authentication can point me in the right direction, I would appreciate it.
推荐答案
我能够弄清楚发生了什么.
I was able to figure out what was going on.
1:这被认为是DOWNLOADER_MIDDLEWARE"而不是SPIDER_MIDDLEWARE".
1: This is considered a "DOWNLOADER_MIDDLEWARE" not a "SPIDER_MIDDLEWARE".
DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }
2:我尝试使用的中间件需要进行重大修改.以下是对我有用的方法:
2: The middleware which I was trying to use needed to be modified significantly. Here is what works for me:
from scrapy.http import Response
import requests
from requests_ntlm import HttpNtlmAuth
class NTLM_Middleware(object):
def process_request(self, request, spider):
url = request.url
pwd = getattr(spider, 'http_pass', '')
usr = getattr(spider, 'http_user', '')
s = requests.session()
response = s.get(url,auth=HttpNtlmAuth(usr,pwd))
return Response(url,response.status_code,{}, response.content)
在蜘蛛中,您需要做的就是设置这些变量:
Within the spider, all you need to do is set these variables:
http_user = 'DOMAIN\\USER'
http_pass = 'PASS'
这篇关于使用 Scrapy 进行 NTLM 身份验证以进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!