使用 Scrapy 和 selenium 抓取网站 [英] Scraping a website using Scrapy and selenium
问题描述
我要抓取 html 内容http://ntry.com/#/scores/named_ladder/main.php 使用 Scrapy
.
但是,由于网站的 Javascript
使用和 # ,我想我必须使用Selenium
(Python
) 也是.
我想编写自己的代码,但我是编程新手,所以我想我需要帮助;
我想先进入ntry.com,然后转到http://ntry.com/#/scores/named_ladder/main.php 通过点击一个名为
的锚点<div id="包裹"><div id="容器"><div id="内容"><a href="/scores/named_ladder/main.php">사다리</a>
然后我想使用 Scrapy
在更改后的页面上抓取 html.
如何制作 selenium
混合的 Scrapy
蜘蛛?
我安装了 Selenium,然后加载了 PhantomJS 模块,效果很好.
这是你可以尝试的
from selenium import webdriver从 selenium.webdriver.common.desired_capabilities 导入 DesiredCapabilities类FormSpider(蜘蛛):名称 = "表格"def __init__(self):dcap = dict(DesiredCapabilities.PHANTOMJS)dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36")self.driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--web-security=false'])self.driver.set_window_size(1366,768)def parse_page(self, response):self.driver.get(response.url)cookies_list = self.driver.get_cookies()
I am going to scrape html contents on
http://ntry.com/#/scores/named_ladder/main.php with Scrapy
.
But, because of the site's Javascript
use and # , I guess I have to use
Selenium
(Python
) also.
I'd like to write my own code, but I am new to programming so I guess I need help;
I want to enter ntry.com first, and move to http://ntry.com/#/scores/named_ladder/main.php by clicking an anchor called
<body>
<div id="wrap">
<div id="container">
<div id="content">
<a href="/scores/named_ladder/main.php">사다리</a>
</div>
</div>
</div>
</body>
and then I want to scrape htmls on the changed page using Scrapy
.
How can I make a selenium
-blended Scrapy
spider?
I installed Selenium and then loaded PhantomJS module and it worked perfectly.
Here is what you can try
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
class FormSpider(Spider):
name = "form"
def __init__(self):
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36")
self.driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--web-security=false'])
self.driver.set_window_size(1366,768)
def parse_page(self, response):
self.driver.get(response.url)
cookies_list = self.driver.get_cookies()
这篇关于使用 Scrapy 和 selenium 抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!