如何使用scrapy抓取xml url [英] How to scrape xml urls with scrapy

查看:42
本文介绍了如何使用scrapy抓取xml url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy来抓取xml url

Hi i am working on scrapy to scrape xml urls

假设下面是我的spider.py代码

Suppose below is my spider.py code

class TestSpider(BaseSpider):
    name = "test"
    allowed_domains = {"www.example.com"}


    start_urls = [
        "https://example.com/jobxml.asp"
        ]


    def parse(self, response):
        print response,"??????????????????????"

结果:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines: 
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
    {'downloader/request_bytes': 651,
     'downloader/request_count': 3,
     'downloader/request_method_count/GET': 3,
     'downloader/response_bytes': 504,
     'downloader/response_count': 3,
     'downloader/response_status_count/400': 3,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
     'scheduler/memory_enqueued': 3,
     'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 263143424, 'memusage/startup': 263143424}

scrapy 是否不适用于 xml 抓取,如果是,请提供一个关于如何抓取 xml 标签数据的示例

Whether scrapy does n't work for xml scraping, if yes can anyone please provide me an example on how to scrape xml tag data

提前致谢...........

Thanks in advance...........

推荐答案

您有一个专门用于抓取 xml 提要的蜘蛛程序.这是来自scrapy文档:

You have a specific spider made for scraping xml feeds. This is from scrapy documentation:

XMLFeedSpider 示例

XMLFeedSpider example

这些蜘蛛很容易使用,让我们看一个例子:

These spiders are pretty easy to use, let’s have a look at one example:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

        item = Item()
        item['id'] = node.select('@id').extract()
        item['name'] = node.select('name').extract()
        item['description'] = node.select('description').extract()
        return item

这是另一种没有scrapy的方法:

This is another way without scrapy:

这是一个用于从给定的 url 下载 xml 的函数,注意一些导入不在此处,这也将为您下载 xml 文件提供一个很好的进度.

This is a function used to download xml from given url, note that some import are not in here and this will also give you a nice progress for downloading xml file.

def get_file(self, dir, url, name):
    s = urllib2.urlopen(url)
    f = open('xml/test.xml','w')
    meta = s.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading: %s Bytes: %s" % (name, file_size)
    current_file_size = 0
    block_size = 4096
    while True:
        buf = s.read(block_size)
        if not buf:
            break
        current_file_size += len(buf)
        f.write(buf)
        status = ("\r%10d  [%3.2f%%]" %
                 (current_file_size, current_file_size * 100. / file_size))
        status = status + chr(8)*(len(status)+1)
        sys.stdout.write(status)
        sys.stdout.flush()
    f.close()
    print "\nDone getting feed"
    return 1

然后您解析您下载并使用 iterparse 保存的 xml 文件,例如:

And then you parse that xml file that you downloaded and saved with iterparse, something like:

for event, elem in iterparse('xml/test.xml'):
        if elem.tag == "properties":
            print elem.text

这只是一个例子,你如何通过 xml 树.

That's just an example how do you go through xml tree.

另外,这是我的旧代码,所以你最好使用 with 打开文件.

Also, this is an old code of mine, so you would be better of using with for opening files.

这篇关于如何使用scrapy抓取xml url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆