如何使用硒和scrapy来自动化该过程? [英] How to use selenium along with scrapy to automate the process?
问题描述
有一次我知道您需要使用诸如硒之类的webtoolkits来自动进行抓取.
I came to know at one point you need to use webtoolkits like selenium to automate the scraping.
我将如何单击Google Play商店上的下一个按钮,以便仅将刮刮的评论用于我的大学学习目的!
How I will be able to click the next button on google play store in order to scrape the reviews for my college purpose only !!
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from selenium import webdriver
import time
class Product(scrapy.Item):
title = scrapy.Field()
class FooSpider(CrawlSpider):
name = 'foo'
start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Chrome(executable_path="C:\chrm\chromedriver.exe")
self.browser.implicitly_wait(60) #
def parse(self,response):
self.browser.get(response.url)
sites = response.xpath('//div[@class="single-review"]/div[@class="review-header"]')
items = []
for i in range(0,200):
time.sleep(20)
button = self.browser.find_element_by_xpath("/html/body/div[4]/div[6]/div[1]/div[2]/div[2]/div[1]/div[2]/button[1]/div[2]/div/div")
button.click()
self.browser.implicitly_wait(30)
for site in sites:
item = Product()
item['title'] = site.xpath('.//div[@class="review-info"]/span[@class="author-name"]/a/text()').extract()
yield item
我已经更新了代码,一次又一次地给了我40个重复项.for循环出了什么问题?
I have updated my code and it is only giving me repeative 40 items again and again.whats wrong with my for loop?
似乎正在更新的源代码没有传递到xpath,这就是为什么它返回相同40个项目的原因
推荐答案
我会做类似的事情:
from scrapy import CrawlSpider
from selenium import webdriver
import time
class FooSpider(CrawlSpider):
name = 'foo'
allow_domains = 'foo.com'
start_urls = ['foo.com']
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
self.download_delay = 0.25
self.browser = webdriver.Firefox()
self.browser.implicitly_wait(60)
def parse_foo(self.response):
self.browser.get(response.url) # load response to the browser
button = self.browser.find_element_by_xpath("path") # find
# the element to click to
button.click() # click
time.sleep(1) # wait until the page is fully loaded
source = self.browser.page_source # get source of the loaded page
sel = Selector(text=source) # create a Selector object
data = sel.xpath('path/to/the/data') # select data
...
It's better not to wait for a fixed amount of time, though. So instead of time.sleep(1)
, you can use one of the approaches described here http://www.obeythetestinggoat.com/how-to-get-selenium-to-wait-for-page-load-after-a-click.html.
这篇关于如何使用硒和scrapy来自动化该过程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!