使用scrapy抓取动态内容 [英] Crawling dynamic content with scrapy
问题描述
我正在尝试从 Google Play 商店获取最新评论.我正在关注此问题以获取最新评论此处
上述链接的答案中指定的方法适用于scrapy shell,但是当我在爬虫中尝试此方法时,它被完全忽略了.
代码片段:
导入重新导入系统导入时间导入 urllib导入 urlparse从scrapy进口蜘蛛从 scrapy.spider 导入 BaseSpider从scrapy.http导入请求,FormRequest从 scrapy.contrib.spiders 导入 CrawlSpider,规则从 scrapy.contrib.linkextractors.lxmlhtml 导入 LxmlLinkExtractor从 play.items 导入 PlayApp类 PlaySpider(CrawlSpider):名称 = "播放"allowed_domains = ["play.google.com"]start_urls = [https://play.google.com/store/apps"]规则 = (Rule(LxmlLinkExtractor(allow=('/store/apps$', )), callback='parseCategory',follow=True),)def parseCategory(self, response):"""从商店主页获取类别,为每个类别调用 parseLinks"""#这里有东西......产生请求(categoryapps,回调=self.parseLinks)def parseLinks(self, response):'''从类别页面获取所有链接,然后将单个链接传递给 parseApp 函数.'''#这里有东西产生请求(链接,回调=self.parseApp)def parseApp(self, response):'''解析应用程序页面以获取有关应用程序的信息'''#应用页面解析......frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}url = "https://play.google.com/store/getreviews"yield FormRequest(url, callback=self.parse_data, formdata=frmdata)收益应用def parse_data(self, response):# 处理数据...print '\n\n---------------我在这里-----------\n\n'
这个函数 parse_data 永远不会被调用.在#scrapy IRC 和其他几个地方问过这个问题,但没有帮助.请帮我解决这个问题.
这是终端上的调试响应:
DEBUG: Crawled (200) (参考:https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary)2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=com.merriamwebster)2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews>(参考:https://play.google.com/store/apps/details?id=an.HindiTranslate)
因此确实发送了 POST 请求,但未调用回调方法.
您好像没有更改表单数据中的 id
.
def parseApp(self, response):apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract()))url = "https://play.google.com/store/getreviews"对于应用程序中的应用程序:_id = app.strip('/store/apps/details?id=')form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}睡觉(5)yield FormRequest(url=url, formdata=form_data, callback=self.parse_data)def parse_app(self, response):response_data = re.findall("\[\[.*", response.body)如果响应数据:尝试:text = json.loads(response_data[0] + ']')出售 = 选择器(文本=文本[0][2])除了:经过# 使用sell.xapth('YOUR_XPATH_HERE') 做任何你想提取的事情
清理数据后的示例审查,您将得到类似这样的结果
<a href="/store/people/details?id=106726831005267540508"><img class="author-image" alt="Lorence Gerona 头像" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48"></a><div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9<div class="review-info"><span class="作者姓名"><a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a></span><span class="review-date">2015 年 6 月 3 日</span><类别= 评语-固定链接的" href = /存储/应用/细节ID = com.supercell.boombeach&安培;安培; reviewId = Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" 标题= 链接到这条" ></A><div class="review-source" style="display:none"><div class="review-info-star-rating"><div class="tiny-star star-rating-non-editable-container" aria-label="评分为五颗星中的五颗星"><div class="current-rating" style="width: 100%;">