使用scrapy时如何绕过'cookiewall'? [英] How to bypass a 'cookiewall' when using scrapy?

查看:152
本文介绍了使用scrapy时如何绕过'cookiewall'?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Scrapy的新用户。在遵循了从网站提取数据的教程之后,我试图在论坛上完成一些类似的工作。

I'm a new user to Scrapy. After following the tutorials for extracting data from websites, I am trying to accomplish something similar on forums.

我要提取的是论坛页面上的所有帖子(从头开始)。但是,这个特定的论坛有一个 cookie墙。因此,当我想从 http://forum.fok.nl/topic/2413069 ,每个会话我首先需要点击是的,我接受Cookie按钮。

What I want is to extract all posts on a forum page (to start with). However, this particular forum has a 'cookie wall'. So when I want to extract from http://forum.fok.nl/topic/2413069, each session I first need to click the "Yes, I accept cookies"-button.

我目前最基本的刮板如下:

My very basic scraper currently looks like this:

class FokSpider(scrapy.Spider):
name = 'fok'
allowed_domains = ['forum.fok.nl']
start_urls = ['http://forum.fok.nl/']

def parse(self,response):
    divs = response.xpath("//div").extract()
    yield {'divs': divs}
    pass

我得到的div不是来自实际的论坛线程,而是来自cookie墙。

The divs I get are not from the actual forum thread, but from the cookie wall.

以下是按钮的html:

Here's the html of the button:

<a href="javascript:acceptCookies()" class="button acc CookiesOK" onclick="document.forms['cookies'].submit();acceptCookies();">Ja, Ik wil een goed werkende site...<span class="smaller">...en accepteer de cookies</span></a>

有人可以指出正确的方法绕过Cookie的正确方向(人为地单击按钮)并转到我要抓取的实际网页? (即使正确的Google搜索字词/文档页面等也将非常有用)

Can anyone point me in the right direction on how to bypass this cookiewall (artificially 'click' the button) and go to the actual webpage I'm trying to scrape? (Even the right Google search terms/documentation pages etc would be very helpful)

推荐答案

最后,我发现了多种解决方法这个问题:

In the end I found multiple ways to solve this problem:


  • 只需添加 /?token = 77c1f767bc31859fee1ffe041343fa48& allowcookies = ACCEPTEER + ALLE + COOKIES 到开始url在这种情况下有效

  • 后来我改用 CrawlSpider 而不是普通的Spider,然后我可以将cookie按钮的xpath添加为第一个规则

  • 使用前面提到的单击按钮也可以,但是确实没有很多麻烦……

  • Simply having adding /?token=77c1f767bc31859fee1ffe041343fa48&allowcookies=ACCEPTEER+ALLE+COOKIES to the start url worked for this specific case
  • I later switched to a CrawlSpider instead of a normal Spider, then I could add the xpath of the cookie button as the first rule.
  • Clicking the button using the earlier mentioned Selenium also worked, but is a lot of hassle that is not really necessary...

这篇关于使用scrapy时如何绕过'cookiewall'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆