使用scrapy抓取动态内容 [英] Crawling dynamic content with scrapy

查看:48
本文介绍了使用scrapy抓取动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 Google Play 商店获取最新评论.我正在关注此问题以获取最新评论此处

上述链接的答案中指定的方法适用于scrapy shell,但是当我在爬虫中尝试此方法时,它被完全忽略了.

代码片段:

导入重新导入系统导入时间导入 urllib导入 urlparse从scrapy进口蜘蛛从 scrapy.spider 导入 BaseSpider从scrapy.http导入请求,FormRequest从 scrapy.contrib.spiders 导入 CrawlSpider,规则从 scrapy.contrib.linkextractors.lxmlhtml 导入 LxmlLinkExtractor从 play.items 导入 PlayApp类 PlaySpider(CrawlSpider):名称 = "播放"allowed_domains = ["play.google.com"]start_urls = [https://play.google.com/store/apps"]规则 = (Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),)def parseCategory(self, response):"""从商店主页获取类别,为每个类别调用 parseLinks"""#这里有东西......产生请求(categoryapps,回调=self.parseLinks)def parseLinks(self, response):'''从类别页面获取所有链接,然后将单个链接传递给 parseApp 函数.'''#这里有东西产生请求(链接,回调=self.parseApp)def parseApp(self, response):'''解析应用程序页面以获取有关应用程序的信息'''#应用页面解析......frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}url = "https://play.google.com/store/getreviews"yield FormRequest(url, callback=self.parse_data, formdata=frmdata)收益应用def parse_data(self, response):# 处理数据...print '\n\n---------------我在这里-----------\n\n'

这个函数 parse_data 永远不会被调用.在#scrapy IRC 和其他几个地方问过这个问题,但没有帮助.请帮我解决这个问题.

这是终端上的调试响应:

DEBUG: Crawled (200) (参考:https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.merriamwebster)2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=an.HindiTranslate)

因此确实发送了 POST 请求,但未调用回调方法.

解决方案

您好像没有更改表单数据中的 id.

def parseApp(self, response):apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))url = "https://play.google.com/store/getreviews"对于应用程序中的应用程序:_id = app.strip('/store/apps/details?id=')form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}睡觉(5)yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)def parse_app(self, response):response_data = re.findall("\[\[.*", response.body)如果响应数据:尝试:text = json.loads(response_data[0] + ']')出售 = 选择器(文本=文本[0][2])除了:经过# 使用sell.xapth('YOUR_XPATH_HERE') 做任何你想提取的事情

清理数据后的示例审查,您将得到类似这样的结果

<a href="/store/people/details?id=106726831005267540508"><img class="author-image" alt="Lorence Gerona 头像" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48"></a><div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9<div class="review-info"><span class="作者姓名"><a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a></span><span class="review-date">2015 年 6 月 3 日</span><类别= 评语-固定链接的" href = /存储/应用/细节ID = com.supercell.boombeach&安培;安培; reviewId = Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" 标题= 链接到这条" ></A><div class="review-source" style="display:none">

<div class="review-info-star-rating"><div class="tiny-star star-rating-non-editable-container" aria-label="评分为五颗星中的五颗星"><div class="current-rating" style="width: 100%;">

<div class="rate-review-wrapper"><div class="play-button icon-button small rate-review" title="垃圾邮件" data-rating="SPAM"><div class="icon spam-flag"></div>

<div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL"><div class="图标竖起大拇指"></div>

<div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"><div class="图标拇指向下"></div>

<div class="review-body"><span class="review-title">团队 BOOM BEACH</span>很棒的游戏我可以打败锤子<div class="review-link" style="display:none"><a class="id-no-nav play-button tiny" href="#" target="_blank">完整评论</a>

I am trying to get latest review from Google play store. I'm following this question for getting the latest reviews here

Method specified in the above link's answer works fine with scrapy shell but when I try this in my crawler it gets completely ignored.

Code snippet:

import re
import sys
import time
import urllib
import urlparse

from scrapy import Spider
from scrapy.spider import BaseSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

from play.items import PlayApp

class PlaySpider(CrawlSpider):
    name = "play"
    allowed_domains = ["play.google.com"]
    start_urls = [
            "https://play.google.com/store/apps"
        ]

    rules = (
        Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),
    )

    def parseCategory(self, response):
        """
            gets categories from store home page call parseLinks for each category
        """
        #something here......
        yield Request(categoryapps, callback=self.parseLinks)

    def parseLinks(self, response):

        '''
        get all the links from the category page and then 
        pasess individual links to parseApp function.
        '''    
        #something here

        yield Request(link, callback=self.parseApp)

    def parseApp(self, response):

        '''
        parses apps page to get info about the app
        '''

        #application page parsing ......        

        frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        url = "https://play.google.com/store/getreviews"
        yield FormRequest(url, callback=self.parse_data, formdata=frmdata)

        yield app

    def parse_data(self, response):
        # do stuff with data...
        print '\n\n---------------I am here------------------\n\n'

This function parse_data is never called. Asked this on #scrapy IRC and few other places but no help. Please help me with this.

this is DEBUG response on terminal:

DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster)
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)

So a POST request is indeed getting sent but callback method is not called.

解决方案

Seems like you haven't changing the id in the form data.

def parseApp(self, response):
    apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))
    url = "https://play.google.com/store/getreviews"
    for app in apps:
        _id = app.strip('/store/apps/details?id=')
        form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}
        sleep(5)
        yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)

def parse_app(self, response):
    response_data = re.findall("\[\[.*", response.body)
    if response_data:
        try:
            text = json.loads(response_data[0] + ']')
            sell = Selector(text=text[0][2])
        except:
            pass
        # do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')

A sample review after cleaning the data you will be getting something like this

<div class="single-review">
    <a href="/store/people/details?id=106726831005267540508">
        <img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48">
    </a>
    <div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw">
        <div class="review-info">
            <span class="author-name">
                <a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a>
            </span>
            <span class="review-date">3 June 2015</span>
            <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&amp;reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none">

        </div>
        <div class="review-info-star-rating">
            <div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars">
                <div class="current-rating" style="width: 100%;">

                </div>
            </div>
        </div>
    </div>
    <div class="rate-review-wrapper">
        <div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM">
            <div class="icon spam-flag"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL">
            <div class="icon thumbs-up"></div>
        </div>
        <div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div>
    </div>
</div>
</div>
<div class="review-body">
<span class="review-title">Team BOOM BEACH</span>
Amazing game I can defeat hammerman
<div class="review-link" style="display:none">
    <a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a>
</div>
</div>
</div>

这篇关于使用scrapy抓取动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
Python最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆