硒与scrapy的动态页面 [英] selenium with scrapy for dynamic page

查看:118
本文介绍了硒与scrapy的动态页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scrapy从网页上抓取产品信息.我的待抓取网页看起来像这样:

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • 从包含10个产品的product_list页面开始
  • 单击下一步"按钮将加载下10个产品(两个页面之间的网址不变)
  • 我使用LinkExtractor跟随每个产品链接进入产品页面,并获取我需要的所有信息

我试图复制下一个按钮的ajax调用,但是无法正常工作,因此我尝试使用硒.我可以在单独的脚本中运行selenium的webdriver,但是我不知道如何与scrapy集成.硒部分应该放在哪里我的刮spider蜘蛛里?

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

我的蜘蛛非常标准,如下所示:

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

任何想法都值得赞赏.谢谢!

Any idea is appreciated. Thank you!

推荐答案

这实际上取决于您需要如何刮取网站以及您希望如何以及要获取哪些数据.

It really depends on how do you need to scrape the site and how and what data do you want to get.

这是一个示例,您可以使用Scrapy + Selenium在ebay上进行分页:

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

以下是硒蜘蛛"的一些示例:

Here are some examples of "selenium spiders":

  • Executing Javascript Submit form functions using scrapy in python
  • https://gist.github.com/cheekybastard/4944914
  • https://gist.github.com/irfani/1045108
  • http://snipplr.com/view/66998/

除了必须将SeleniumScrapy一起使用之外,还有另一种选择.在某些情况下,使用 ScrapyJS中间件足以处理页面的动态部分.实际用法示例:

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

这篇关于硒与scrapy的动态页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆