使用 Scrapy 和 selenium 抓取网站 [英] Scraping a website using Scrapy and selenium

查看:39
本文介绍了使用 Scrapy 和 selenium 抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要抓取 html 内容http://ntry.com/#/scores/named_ladder/main.php 使用 Scrapy.

但是,由于网站的 Javascript 使用和 # ,我想我必须使用Selenium (Python) 也是.

我想编写自己的代码,但我是编程新手,所以我想我需要帮助;

我想先进入ntry.com,然后转到http://ntry.com/#/scores/named_ladder/main.php 通过点击一个名为

的锚点

<div id="包裹"><div id="容器"><div id="内容"><a href="/scores/named_ladder/main.php">사다리</a>

然后我想使用 Scrapy 在更改后的页面上抓取 html.

如何制作 selenium 混合的 Scrapy 蜘蛛?

解决方案

我安装了 Selenium,然后加载了 PhantomJS 模块,效果很好.

这是你可以尝试的

from selenium import webdriver从 selenium.webdriver.common.desired_capabilities 导入 DesiredCapabilities类FormSpider(蜘蛛):名称 = "表格"def __init__(self):dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36")self.driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--web-security=false'])self.driver.set_window_size(1366,768)def parse_page(self, response):self.driver.get(response.url)cookies_list = self.driver.get_cookies()

I am going to scrape html contents on http://ntry.com/#/scores/named_ladder/main.php with Scrapy.

But, because of the site's Javascript use and # , I guess I have to use Selenium (Python) also.

I'd like to write my own code, but I am new to programming so I guess I need help;

I want to enter ntry.com first, and move to http://ntry.com/#/scores/named_ladder/main.php by clicking an anchor called

<body>
    <div id="wrap">
        <div id="container">
            <div id="content">
                <a href="/scores/named_ladder/main.php">사다리</a>
            </div>
        </div>
    </div>
</body>

and then I want to scrape htmls on the changed page using Scrapy.

How can I make a selenium-blended Scrapy spider?

解决方案

I installed Selenium and then loaded PhantomJS module and it worked perfectly.

Here is what you can try

from selenium import webdriver 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

class FormSpider(Spider):
    name = "form"

    def __init__(self):

        dcap = dict(DesiredCapabilities.PHANTOMJS)
        dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36")

        self.driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--web-security=false'])
        self.driver.set_window_size(1366,768)


    def parse_page(self, response):
            self.driver.get(response.url)
            cookies_list = self.driver.get_cookies()

这篇关于使用 Scrapy 和 selenium 抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆