在需要身份验证的地方使用BeautifulSoup [英] Using BeautifulSoup where authentication is required

查看：465 发布时间：2020/4/27 3:21:14 python web-scraping beautifulsoup lan intranet

本文介绍了在需要身份验证的地方使用BeautifulSoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用BeautifulSoup4和Python对公司项目的请求来抓取LAN数据.由于该站点具有登录界面，因此我无权访问数据.登录界面是一个弹出窗口，不允许我未经登录而访问页面源或检查页面元素.我得到的错误是-

I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this-

访问错误:未经授权访问此文档需要用户ID

Access Error: Unauthorized Access to this document requires a User ID

这是弹出框的屏幕截图(变黑的部分是敏感信息).它根本没有有关html标签的信息，因此我无法通过python自动登录.

This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information about the html tags at all, hence I cannot auto-login via python.

我尝试了request_ntlm，selenium，python请求，甚至ParseHub，但是没有用.我已经在这个阶段停留了一个月！请任何帮助将不胜感激.

I have tried requests_ntlm, selenium, python requests and even ParseHub but it did not work. I have been stuck in this phase for a month now! Please, any help would be appreciated.

下面是我的初始代码:

import requests
from requests_ntlm import HttpNtlmAuth
from bs4 import BeautifulSoup
r = requests.get("www.amazon.in")
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
print r.content
r = requests.get("www.amazon.in",auth=HttpNtlmAuth('user_name','passwd'))
print r.content*

s_data = BeautifulSoup(r.content,"lxml")*
print s_data.content

错误: 文件错误:未经授权

访问错误:未经授权

访问此文档需要用户ID

Error: Document Error: Unauthorized
Access Error: Unauthorized

Access to this document requires a User ID

这是我手动登录网站后，BeautifulSoup尝试访问数据时遇到的错误.

This is the error I get when BeautifulSoup tries to access the data after I have manually logged into the site.

推荐答案

您是否考虑过使用机械化工具?

Have you considered using mechanise?

import mechanize from bs4 import BeautifulSoup import urllib2 import cookielib cook = cookielib.CookieJar() req = mechanize.Browser() req.set_cookiejar(cook) req.open("http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1") req.select_form(nr=0) req.form['username'] = 'username' req.form['password'] = 'password.' req.submit() print req.response().read()

编辑

如果遇到robots.txt问题，并且您有权限规避此问题，请查看此答案以获取执行此操作的技术 https://stackoverflow.com/questions/13303449/urllib2-httperror- http-error-403-forbidden

If you come up against robots.txt issues and you have permission to circumvent this then take a look at this answer for techniques to do this https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

这篇关于在需要身份验证的地方使用BeautifulSoup的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在需要身份验证的地方使用BeautifulSoup [英] Using BeautifulSoup where authentication is required

问题描述

访问错误:未经授权

Access Error: Unauthorized

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在需要身份验证的地方使用BeautifulSoup [英] Using BeautifulSoup where authentication is required

问题描述

访问错误:未经授权

Access Error: Unauthorized

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭