用请求库绕过侵入性Cookie语句 [英] Bypassing intrusive cookie statement with requests library
问题描述
我正在尝试使用requests
库对网站进行爬网.但是,我正在尝试访问的特定网站( http://www.vi.nl/matchcenter/vandaag.shtml )具有非常侵入性的cookie语句.
I'm trying to crawl a website using the requests
library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
我正在尝试按以下方式访问该网站:
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
这将返回一个仅包含cookie语句和一个接受大按钮的网页.如果尝试在浏览器中访问此页面,则会发现按下按钮会将您重定向到所请求的页面.如何使用requests
做到这一点?
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests
?
我考虑过使用mechanize.Browser
,但这似乎是一种相当round回的方式.
I considered using mechanize.Browser
but that seems a pretty roundabout way of doing it.
推荐答案
尝试设置:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
这将绕过Cookie同意页面,并使您直接进入该页面.
This will bypass the cookie consent page and will land you staight to the page.
注意:您可以通过分析cookie浓缩页面上运行的javascript代码来找到以上内容,虽然有点混淆,但这并不难.如果您再次遇到相同类型的问题,请查看在事件的处理集上执行的javascript代码是哪种cookie.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
这篇关于用请求库绕过侵入性Cookie语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!