在python中使用scrapy执行Javascript提交表单功能 [英] Executing Javascript Submit form functions using scrapy in python

查看:58
本文介绍了在python中使用scrapy执行Javascript提交表单功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 框架废弃一个网站,但在单击 javascript 链接以打开另一个页面时遇到问题.

I am scrapping a site using scrapy framework and having trouble clicking on a javascript link for opening another page.

我可以将页面上的代码识别为:

I can identify the code on the page as:

<a class="Page" alt="Click to view job description" title="Click to view job description" href="javascript:sysSubmitForm('frmSR1');">Accountant&nbsp;</a>

任何人都可以建议我如何在 scaroy 中执行该 javascript 并通过我可以从该页面获取数据的另一个页面.

can any one suggest me how to execute that javascript in scaroy and get another page through i can fetch data from that page.

提前致谢

推荐答案

查看下面关于如何在 selenium 中使用scrapy 的片段.爬行会变慢,因为您不仅要下载 html,而且还可以完全访问 DOM.

Checkout the below snipped on how to use scrapy with selenium. Crawling will be slower as you aren't just downloading the html but you will get full access to the DOM.

注意:我复制粘贴了这个片段,因为之前提供的链接不再有效.

Note: I have copy-pasted this snippet as the links previously provided no longer work.

# Snippet imported from snippets.scrapy.org (which no longer works)

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request

from selenium import selenium

class SeleniumSpider(CrawlSpider):
    name = "SeleniumSpider"
    start_urls = ["http://www.domain.com"]

    rules = (
        Rule(SgmlLinkExtractor(allow=('\.html', )),
        callback='parse_page',follow=True),
    )

    def __init__(self):
        CrawlSpider.__init__(self)
        self.verificationErrors = []
        self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
        self.selenium.start()

    def __del__(self):
        self.selenium.stop()
        print self.verificationErrors
        CrawlSpider.__del__(self)

    def parse_page(self, response):
        item = Item()

        hxs = HtmlXPathSelector(response)
        #Do some XPath selection with Scrapy
        hxs.select('//div').extract()

        sel = self.selenium
        sel.open(response.url)

        #Wait for javscript to load in Selenium
        time.sleep(2.5)

        #Do some crawling of javascript created content with Selenium
        sel.get_text("//div")
        yield item

这篇关于在python中使用scrapy执行Javascript提交表单功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆