用请求库绕过侵入性Cookie语句 [英] Bypassing intrusive cookie statement with requests library

查看:70
本文介绍了用请求库绕过侵入性Cookie语句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用requests库对网站进行爬网.但是,我正在尝试访问的特定网站( http://www.vi.nl/matchcenter/vandaag.shtml )具有非常侵入性的cookie语句.

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.

我正在尝试按以下方式访问该网站:

I am trying to access the website as follows:

from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")

这将返回一个仅包含cookie语句和一个接受大按钮的网页.如果尝试在浏览器中访问此页面,则会发现按下按钮会将您重定向到所请求的页面.如何使用requests做到这一点?

This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?

我考虑过使用mechanize.Browser,但这似乎是一种相当round回的方式.

I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.

推荐答案

尝试设置:

cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)

这将绕过Cookie同意页面,并使您直接进入该页面.

This will bypass the cookie consent page and will land you staight to the page.

注意:您可以通过分析cookie浓缩页面上运行的javascript代码来找到以上内容,虽然有点混淆,但这并不难.如果您再次遇到相同类型的问题,请查看在事件的处理集上执行的javascript代码是哪种cookie.

Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.

这篇关于用请求库绕过侵入性Cookie语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆