如何使用 selenium 和 scrapy 来自动化这个过程? [英] How to use selenium along with scrapy to automate the process?

查看:26
本文介绍了如何使用 selenium 和 scrapy 来自动化这个过程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一度了解到您需要使用 selenium 等 webtoolkits 来自动化抓取.

I came to know at one point you need to use webtoolkits like selenium to automate the scraping.

我如何能够点击 google Play 商店上的下一个按钮,以便仅为我的大学目的抓取评论!!

How I will be able to click the next button on google play store in order to scrape the reviews for my college purpose only !!

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time


class Product(scrapy.Item):
    title = scrapy.Field()


class FooSpider(CrawlSpider):
    name = 'foo'

    start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Chrome(executable_path="C:chrmchromedriver.exe")
        self.browser.implicitly_wait(60) # 

    def parse(self,response):
        self.browser.get(response.url)
        sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')
        items = []
        for i in range(0,200):
            time.sleep(20)
            button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
            button.click()
            self.browser.implicitly_wait(30)    
            for site in sites:
                item = Product()

                item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
                yield item

我已经更新了我的代码,但它一次又一次地给我重复的 40 个项目.我的 for 循环有什么问题?

I have updated my code and it is only giving me repeative 40 items again and again.whats wrong with my for loop?

似乎正在更新的源代码没有传递给 xpath,这就是为什么它返回相同的 40 个项目

推荐答案

我会这样做:

from scrapy import CrawlSpider
from selenium import webdriver
import time

class FooSpider(CrawlSpider):
    name = 'foo'
    allow_domains = 'foo.com'
    start_urls = ['foo.com']

    def __init__(self, *args, **kwargs):
        super(FooSpider, self).__init__(*args, **kwargs)
        self.download_delay = 0.25
        self.browser = webdriver.Firefox()
        self.browser.implicitly_wait(60)

    def parse_foo(self.response):
        self.browser.get(response.url)  # load response to the browser
        button = self.browser.find_element_by_xpath("path") # find 
        # the element to click to
        button.click() # click
        time.sleep(1) # wait until the page is fully loaded
        source = self.browser.page_source # get source of the loaded page
        sel = Selector(text=source) # create a Selector object
        data = sel.xpath('path/to/the/data') # select data
        ...

不过,最好不要等待固定的时间.因此,您可以使用此处描述的方法之一,而不是 time.sleep(1)http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html.

It's better not to wait for a fixed amount of time, though. So instead of time.sleep(1), you can use one of the approaches described here http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html.

这篇关于如何使用 selenium 和 scrapy 来自动化这个过程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆