selenium 与scrapy 用于动态页面 [英] selenium with scrapy for dynamic page

查看:32
本文介绍了selenium 与scrapy 用于动态页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 scrapy 从网页中抓取产品信息.我要抓取的网页是这样的:

I'm trying to scrape product information from a webpage, using scrapy. My to-be-scraped webpage looks like this:

  • 从包含 10 个产品的 product_list 页面开始
  • 单击下一步"按钮会加载接下来的 10 个产品(两个页面之间的网址不变)
  • 我使用 LinkExtractor 跟随每个产品链接进入产品页面,并获取我需要的所有信息

我尝试复制 next-button-ajax-call 但无法正常工作,所以我尝试使用 selenium.我可以在单独的脚本中运行 selenium 的 webdriver,但我不知道如何与 scrapy 集成.我应该把硒部分放在我的scrapy蜘蛛中吗?

I tried to replicate the next-button-ajax-call but can't get working, so I'm giving selenium a try. I can run selenium's webdriver in a separate script, but I don't know how to integrate with scrapy. Where shall I put the selenium part in my scrapy spider?

我的蜘蛛非常标准,如下所示:

My spider is pretty standard, like the following:

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" %response.url, level=INFO)
        hxs = HtmlXPathSelector(response)
        # actual data follows

任何想法都值得赞赏.谢谢!

Any idea is appreciated. Thank you!

推荐答案

这实际上取决于您需要如何抓取站点,以及您希望如何获取以及获取哪些数据.

It really depends on how do you need to scrape the site and how and what data do you want to get.

这是一个如何使用 Scrapy+Selenium 在 ebay 上跟踪分页的示例:

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['ebay.com']
    start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

以下是硒蜘蛛"的一些示例:

Here are some examples of "selenium spiders":

除了使用 SeleniumScrapy 之外,还有一种替代方法.在某些情况下,使用 ScrapyJS 中间件 足以处理动态部分的一页.实际使用示例:

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

这篇关于selenium 与scrapy 用于动态页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆