如何使用sucuri保护来抓取网站 [英] How to scrape a web site with sucuri protection

查看:136
本文介绍了如何使用sucuri保护来抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Scrapy Documetions ,我想从多个网站抓取和抓取数据,我的代码可以正常使用正常的网站,但是当我要使用 Sucuri 爬网网站时,我没有任何数据,似乎sucuri防火墙阻止了我访问网站标记。

According to Scrapy Documetions I want to crawl and scrape data from several sites, My codes works correctly with usual website,but when I want crawl a website with Sucuri I don't get any data, it seems sucuri firewall prevent me to access to websites markups.

目标网站是 http://www.dwarozh.net/
这是我的蜘蛛摘要

The target website is http://www.dwarozh.net/ and This is my spider snippet

from scrapy import Spider
from scrapy.selector import Selector
import scrapy

from Stack.items import StackItem
from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


    class StackSpider(Spider):
        name = "stack"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]


        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = StackItem()
                item['title'] = mItem.xpath(
                    'a/h2/text()').extract_first()
                item['url'] = mItem.xpath(
                    'viewa/@href').extract_first()
                yield item

这是我得到回应的结果

<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
<script>var s={},u,c,U,r,i,l=0,a,e=eval,w=String.fromCharCode,sucuri_cloudproxy_js='',S='cz0iMHNlYyIuc3Vic3RyKDAsMSkgKyAnNXlCMicuc3Vic3RyKDMsIDEpICsgJycgKycnKyIxIi5zbGljZSgwLDEpICsgJ2pQYycuY2hhckF0KDIpKyJmIiArICIiICsnbz1jJy5jaGFyQXQoMikrICcnICsgCiI0Ii5zbGljZSgwLDEpICsgJ0FvPzcnLnN1YnN0cigzLCAxKSArIjUiICsgU3RyaW5nLmZyb21DaGFyQ29kZSgxMDIpICsgIiIgKycxJyArICAgJycgKyAKIjFzZWMiLnN1YnN0cigwLDEpICsgICcnICsnJysnMycgKyAgImUiLnNsaWNlKDAsMSkgKyAiIiArImZzdSIuc2xpY2UoMCwxKSArICIiICsiMnN1Y3VyIi5jaGFyQXQoMCkrICcnICtTdHJpbmcuZnJvbUNoYXJDb2RlKDEwMCkgKyAgJycgKyI5c3UiLnNsaWNlKDAsMSkgKyAgJycgKycnKyI2IiArICdDYycuc2xpY2UoMSwyKSsiNnN1Ii5zbGljZSgwLDEpICsgJ2YnICsgICAnJyArIAonYScgKyAgIjAiICsgJ2YnICsgICI0IiArICI2c2VjIi5zdWJzdHIoMCwxKSArICAnJyArIAonWnBFMScuc3Vic3RyKDMsIDEpICsiMSIgKyBTdHJpbmcuZnJvbUNoYXJDb2RlKDB4MzgpICsgIiIgKyI1c3VjdXIiLmNoYXJBdCgwKSsiZnN1Ii5zbGljZSgwLDEpICsgJyc7ZG9jdW1lbnQuY29va2llPSdzc3VjJy5jaGFyQXQoMCkrICd1JysnJysnYycuY2hhckF0KDApKyd1c3VjdXInLmNoYXJBdCgwKSsgJ3JzdWMnLmNoYXJBdCgwKSsgJ3N1Y3VyaScuY2hhckF0KDUpICsgJ19zdScuY2hhckF0KDApICsnY3N1Y3VyJy5jaGFyQXQoMCkrICdsJysnbycrJ3UnLmNoYXJBdCgwKSsnZCcrJ3AnKycnKydyc3VjdScuY2hhckF0KDApICArJ3NvJy5jaGFyQXQoMSkrJ3gnKyd5JysnX3N1Y3VyaScuY2hhckF0KDApICsgJ3UnKyd1JysnaXN1Y3VyaScuY2hhckF0KDApICsgJ3N1Y3VkJy5jaGFyQXQoNCkrICdzXycuY2hhckF0KDEpKycxJysnOCcrJzEnKydzdWN1cmQnLmNoYXJBdCg1KSArICdlJy5jaGFyQXQoMCkrJzEnKydzdWN1cjEnLmNoYXJBdCg1KSArICcxc3VjdXJpJy5jaGFyQXQoMCkgKyAnMicrIj0iICsgcyArICc7cGF0aD0vO21heC1hZ2U9ODY0MDAnOyBsb2NhdGlvbi5yZWxvYWQoKTs=';L=S.length;U=0;r='';var A='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';for(u=0;u<64;u++){s[A.charAt(u)]=u;}for(i=0;i<L;i++){c=s[S.charAt(i)];U=(U<<6)+c;l+=6;while(l>=8){((a=(U>>>(l-=8))&0xff)||(i<(L-2)))&&(r+=w(a));}}e(r);</script></html>

我如何用刮擦法绕过苏克里?

推荐答案

站点使用基于cookie和用户代理的保护。您可以通过这种方式进行检查。在Chrome中打开DevTools。导航到目标页面 http://www.dwarozh.net/sport/ ,然后在网络中标签,右键单击该页面的请求,然后复制为CURL
打开控制台并运行cURL:

Site uses cookie- and user-agent based protection. You may check it in such a way. Open DevTools in Chrome. Navigate to the target page http://www.dwarozh.net/sport/, then in Network tab right click on the request to the page and "Copy as CURL" Open console and run the cURL:

$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' --compressed

您将看到普通的html代码。如果您从请求中删除了User-Agent的cookie,则会获得上限页面。

You will see normal html code. If you remove cookie of User-Agent from the request, you get the cap page.

让我们抓紧检查一下吧:

Lets check it in scrapy:

$ scrapy shell
>>> from scrapy import Request
>>> cookie_str = '''here; your; cookies; from; browser; go;'''
>>> cookies = dict(pair.split('=') for pair in cookie_str.split('; '))
>>> cookies  # check them
{'__auc': '999', '__cfduid': '796', '_gat': '1', '__atuvc': '1%7C49', 'sucuri_cloudproxy_uuid_0d5c97a96': '6ab007eb1
9', 'ASP.NET_SessionId': 'u9', '_ga': 'GA1.2.1968.148', '__asc': 'sfsdf', 'sucuri_cloudproxy_uuid_ce2
sfsdfs': 'sdfsdf'}
>>> r = Request(url='http://www.dwarozh.net/sport/', cookies=cookies, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/56 (KHTML, like Gecko) Chrome
/54. Safari/5'})
>>> fetch(r)
>>> response.xpath('//div[@class="news-more-img"]/ul/li')
[<Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10507">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="de
tails.aspx?jimare=10505">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10504">'>, <Selector xpath='//div[@class="news-more-img"]/
ul/li' data='<li><a href="details.aspx?jimare=10503">'>, <Selector xpath='//div[@class="news-more-img"]/ul/li' data='<li><a href="details.aspx?jimare=10323">'>]

非常好!让我们做一个蜘蛛:

Excellent! Let's make a spider:

我修改了您的代码,因为我没有某些组件的源代码。

I've modified yours because I have no source code of some components.

from scrapy import Spider, Request
from scrapy.selector import Selector
import scrapy

#from Stack.items import StackItem
#from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser


class StackSpider(Spider):
        name = "dwarozh"
        start_urls = [
            "http://www.dwarozh.net/sport/",
        ]
        _cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
        _user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'

        def start_requests(self):
            cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
            return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
                    for url in self.start_urls]

        def parse(self, response):
            mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
            for mItem in mItems:
                item = {} # StackItem()
                item['title'] = mItem.xpath('a/h2/text()').extract_first()
                item['url'] = mItem.xpath('viewa/@href').extract_first()
                yield {'url': item['url'], 'title': item['title']}

让它运行:

$ scrapy crawl dwarozh -o - -t csv --loglevel=DEBUG
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
  return f(*args, **kwds)
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-10 00:18:55 [scrapy] INFO: Spider opened
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'}
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 950,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 15121,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
 'item_scraped_count': 5,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
url,title
,"
لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە"
,"
هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید"
,"
گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا"
,"
بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە"
,"
كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"

可能您将不得不不时更新cookie。您可以为此使用PhantomJS。

Possibly you will have to update cookies from time to time. You may use PhantomJS for this.

更新

如何使用PhantomJS获取Cookie。

How to get cookies using PhantomJS.


  1. 安装 PhantomJS

制作类似这样的脚本 dwarosh.js

var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.dwarozh.net/sport/', function(status) {
  console.log("Status: " + status);
  if(status === "success") {
    page.render('example.png');
    page.evaluate(function() {
    return document.title;
  });
  }
  for (var i=0; i<page.cookies.length; i++) {
    var c = page.cookies[i];
    console.log(c.name, c.value);
  };
  phantom.exit();
});


  • 运行脚本:

  • Run script:

      $ phantomjs --cookies-file=cookie.txt dwarosh.js
      TypeError: undefined is not an object (evaluating  'activeElement.position().left')
    
      http://www.dwarozh.net/sport/js/script.js:5
      https://code.jquery.com/jquery-1.10.2.min.js:4 in c
      https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
      https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
      https://code.jquery.com/jquery-1.10.2.min.js:4 in q
    Status: success
    __auc 250ab0a9158ee9e73eeeac78bba
    __asc 250ab0a9158ee9e73eeeac78bba
    _gat 1
    _ga GA1.2.260482211.1481472111
    ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
    sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
    __cfduid d9059962a4c12e0f....1
    


  • 获取Cookie sucuri_cloudproxy_uuid_3e07984e4 并尝试使用 curl 和相同的User-Agent。

  • Get cookie sucuri_cloudproxy_uuid_3e07984e4 and try to get the page with curl and the same User-Agent.

    $ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
    *   Trying 104.25.209.23...
    * Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
    > GET /sport/ HTTP/1.1
    > Host: www.dwarozh.net
    > User-Agent: SpecialAgent
    > Accept: */*
    > Cookie:     sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
    >
    < HTTP/1.1 200 OK
    < Date: Sun, 11 Dec 2016 16:17:04 GMT
    < Content-Type: text/html; charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
    < Cache-Control: private
    < Vary: Accept-Encoding
    < Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
    < X-AspNet-Version: 4.0.30319
    < X-XSS-Protection: 1; mode=block
    < X-Frame-Options: SAMEORIGIN
    < X-Content-Type-Options: nosniff
    < X-Sucuri-ID: 15008
    < Server: cloudflare-nginx
    < CF-RAY: 30fa3ea1335237b0-ARN
    <
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
    Dwarozh : Sport
    </title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
    <script src="js/jquery-2.1.1.js" type="text/javascript"></script>
    <script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
    <script src="js/script.js" type="text/javascript"></script>
    <link href="css/styles.css" rel="stylesheet"/>
    <script src="js/classie.js" type="text/javascript"></script>
    <script type="text/javascript">
    


  • 这篇关于如何使用sucuri保护来抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆