Scrapy Spider用于JSON响应 [英] Scrapy Spider for JSON Response

查看:181
本文介绍了Scrapy Spider用于JSON响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个爬虫,该爬虫通过以下JSON响应进行爬网: http://gdata.youtube.com/feeds /api/standardfeeds/UK/most_popular?v = 2& alt = json

I am trying to write a spider which crawls through the following JSON response: http://gdata.youtube.com/feeds/api/standardfeeds/UK/most_popular?v=2&alt=json

如果我想抓取视频的所有标题,蜘蛛的外观如何?我所有的蜘蛛都不起作用.

How would the spider look if I would want to crawl all the titles of the videos? All my Spiders dont work.

from scrapy.spider import BaseSpider
import json
from youtube.items import YoutubeItem
class MySpider(BaseSpider):
    name = "youtubecrawler"
    allowed_domains = ["gdata.youtube.com"]
    start_urls = ['http://www.gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json']

    def parse(self, response):
        items []
    jsonresponse = json.loads(response)
    for video in jsonresponse["feed"]["entry"]:
        item = YoutubeItem()
        print jsonresponse
        print video["media$group"]["yt$videoid"]["$t"]
        print video["media$group"]["media$description"]["$t"]
        item ["title"] = video["title"]["$t"]
        print video["author"][0]["name"]["$t"]
        print video["category"][1]["term"]
        items.append(item)
    return items

我总是出现以下错误:

2014-01-05 16:55:21+0100 [youtubecrawler] ERROR: Spider error processing <GET http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json>
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
            self.runUntilCurrent()
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 382, in callback
            self._startRunCallbacks(result)
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 490, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/home/bxxxx/svn/ba_txxxxx/scrapy/youtube/spiders/test.py", line 15, in parse
            jsonresponse = json.loads(response)
          File "/usr/lib/python2.7/json/__init__.py", line 326, in loads
            return _default_decoder.decode(s)
          File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
            obj, end = self.raw_decode(s, idx=_w(s, 0).end())
        exceptions.TypeError: expected string or buffer

推荐答案

在您的代码中发现了两个问题:

found two issues in your code:

  1. 无法访问起始网址,我从中取出了www
  2. json.loads(response)更改为json.loads(response.body_as_unicode())
  1. start url is not accessible, I took out the www from it
  2. changed json.loads(response) to json.loads(response.body_as_unicode())

这对我来说很好:

class MySpider(BaseSpider):
    name = "youtubecrawler"
    allowed_domains = ["gdata.youtube.com"]
    start_urls = ['http://gdata.youtube.com/feeds/api/standardfeeds/DE/most_popular?v=2&alt=json']

    def parse(self, response):
        items = []
        jsonresponse = json.loads(response.body_as_unicode())
        for video in jsonresponse["feed"]["entry"]:
            item = YoutubeItem()
            print video["media$group"]["yt$videoid"]["$t"]
            print video["media$group"]["media$description"]["$t"]
            item ["title"] = video["title"]["$t"]
            print video["author"][0]["name"]["$t"]
            print video["category"][1]["term"]
            items.append(item)
        return items

这篇关于Scrapy Spider用于JSON响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆