使用scrapy时如何绕过'cookiewall'? [英] How to bypass a 'cookiewall' when using scrapy?
问题描述
我是Scrapy的新用户。在遵循了从网站提取数据的教程之后,我试图在论坛上完成一些类似的工作。
I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums.
我要提取的是论坛页面上的所有帖子(从头开始)。但是,这个特定的论坛有一个 cookie墙。因此,当我想从
What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button.
我目前最基本的刮板如下:
My very basic scraper currently looks like this:
class FokSpider(scrapy.Spider):
name = 'fok'
allowed_domains = ['forum.fok.nl']
start_urls = ['http://forum.fok.nl/']
def parse(self,response):
divs = response.xpath("//div").extract()
yield {'divs': divs}
pass
我得到的div不是来自实际的论坛线程,而是来自cookie墙。
The divs I get are not from the actual forum thread, but from the cookie wall.
以下是按钮的html:
Here's the html of the button:
<a href="javascript:acceptCookies()" class="button acc CookiesOK" onclick="document.forms['cookies'].submit();acceptCookies();">Ja, Ik wil een goed werkende site...<span class="smaller">...en accepteer de cookies</span></a>
有人可以指出正确的方法绕过Cookie的正确方向(人为地单击按钮)并转到我要抓取的实际网页? (即使正确的Google搜索字词/文档页面等也将非常有用)
Can anyone point me in the right direction on how to bypass this cookiewall (artificially 'click' the button) and go to the actual webpage I'm trying to scrape? (Even the right Google search terms/documentation pages etc would be very helpful)
推荐答案
最后,我发现了多种解决方法这个问题:
In the end I found multiple ways to solve this problem:
- 只需添加
/?token = 77c1f767bc31859fee1ffe041343fa48& allowcookies = ACCEPTEER + ALLE + COOKIES
到开始url在这种情况下有效 - 后来我改用
CrawlSpider
而不是普通的Spider,然后我可以将cookie按钮的xpath添加为第一个规则。 - 使用前面提到的
硒
单击按钮也可以,但是确实没有很多麻烦……
- Simply having adding
/?token=77c1f767bc31859fee1ffe041343fa48&allowcookies=ACCEPTEER+ALLE+COOKIES
to the start url worked for this specific case - I later switched to a
CrawlSpider
instead of a normal Spider, then I could add the xpath of the cookie button as the first rule. - Clicking the button using the earlier mentioned
Selenium
also worked, but is a lot of hassle that is not really necessary...
这篇关于使用scrapy时如何绕过'cookiewall'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!