使用 Scrapy 和 Selenium 进行抓取 [英] Scraping with Scrapy and Selenium

查看:41
本文介绍了使用 Scrapy 和 Selenium 进行抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个爬虫爬虫,它抓取一个网站,通过页面上的 javascript 重新加载内容.为了跳转到下一页进行抓取,我一直在使用Selenium点击网站顶部的月份链接.

I have a scrapy spider which crawls a site that reloads content via javascript on the page. In order to move to the next page to scrape, I have been using Selenium to click on the month link at the top of the site.

问题是,即使我的代码按预期在每个链接中移动,蜘蛛程序也只是抓取第一个月(九月)的月数数据并返回这些重复数据.

The problem is that, even though my code moves through each link as expected, the spider just scrapes the first month (Sept) data for the number of months and returns this duplicate data.

我该如何解决这个问题?

How can I get around this?

from selenium import webdriver

class GigsInScotlandMain(InitSpider):
        name = 'gigsinscotlandmain'
        allowed_domains = ["gigsinscotland.com"]
        start_urls = ["http://www.gigsinscotland.com"]


    def __init__(self):
        InitSpider.__init__(self)
        self.br = webdriver.Firefox()

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        self.br.get(response.url)
        time.sleep(2.5)
        # Get the string for each month on the page.
        months = hxs.select("//ul[@id='gigsMonths']/li/a/text()").extract()

        for month in months:
            link = self.br.find_element_by_link_text(month)
            link.click()
            time.sleep(5)

            # Get all the divs containing info to be scraped.
            listitems = hxs.select("//div[@class='listItem']")
            for listitem in listitems:
                item = GigsInScotlandMainItem()
                item['artist'] = listitem.select("div[contains(@class, 'artistBlock')]/div[@class='artistdiv']/span[@class='artistname']/a/text()").extract()
                #
                # Get other data ...
                #
                yield item

推荐答案

问题是您正在重用为初始响应定义的 HtmlXPathSelector.从 selenium 浏览器重新定义它 source_code:

The problem is that you are reusing HtmlXPathSelector that was defined for the initial response. Redefine it from selenium browser source_code:

...
for month in months:
    link = self.br.find_element_by_link_text(month)
    link.click()
    time.sleep(5)

    hxs = HtmlXPathSelector(self.br.page_source)

    # Get all the divs containing info to be scraped.
    listitems = hxs.select("//div[@class='listItem']")
...

这篇关于使用 Scrapy 和 Selenium 进行抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆