如何使用scrapy抓取xml url [英] How to scrape xml urls with scrapy
问题描述
我正在使用scrapy来抓取xml url
Hi i am working on scrapy to scrape xml urls
假设下面是我的spider.py代码
Suppose below is my spider.py code
class TestSpider(BaseSpider):
name = "test"
allowed_domains = {"www.example.com"}
start_urls = [
"https://example.com/jobxml.asp"
]
def parse(self, response):
print response,"??????????????????????"
结果:
2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines:
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
{'downloader/request_bytes': 651,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 504,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
'scheduler/memory_enqueued': 3,
'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
{'memusage/max': 263143424, 'memusage/startup': 263143424}
scrapy 是否不适用于 xml 抓取,如果是,请提供一个关于如何抓取 xml 标签数据的示例
Whether scrapy does n't work for xml scraping, if yes can anyone please provide me an example on how to scrape xml tag data
提前致谢...........
Thanks in advance...........
推荐答案
您有一个专门用于抓取 xml 提要的蜘蛛程序.这是来自scrapy文档:
You have a specific spider made for scraping xml feeds. This is from scrapy documentation:
XMLFeedSpider 示例
XMLFeedSpider example
这些蜘蛛很容易使用,让我们看一个例子:
These spiders are pretty easy to use, let’s have a look at one example:
from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecesary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
item = Item()
item['id'] = node.select('@id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item
这是另一种没有scrapy的方法:
This is another way without scrapy:
这是一个用于从给定的 url 下载 xml 的函数,注意一些导入不在此处,这也将为您下载 xml 文件提供一个很好的进度.
This is a function used to download xml from given url, note that some import are not in here and this will also give you a nice progress for downloading xml file.
def get_file(self, dir, url, name):
s = urllib2.urlopen(url)
f = open('xml/test.xml','w')
meta = s.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (name, file_size)
current_file_size = 0
block_size = 4096
while True:
buf = s.read(block_size)
if not buf:
break
current_file_size += len(buf)
f.write(buf)
status = ("\r%10d [%3.2f%%]" %
(current_file_size, current_file_size * 100. / file_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write(status)
sys.stdout.flush()
f.close()
print "\nDone getting feed"
return 1
然后您解析您下载并使用 iterparse 保存的 xml 文件,例如:
And then you parse that xml file that you downloaded and saved with iterparse, something like:
for event, elem in iterparse('xml/test.xml'):
if elem.tag == "properties":
print elem.text
这只是一个例子,你如何通过 xml 树.
That's just an example how do you go through xml tree.
另外,这是我的旧代码,所以你最好使用 with 打开文件.
Also, this is an old code of mine, so you would be better of using with for opening files.
这篇关于如何使用scrapy抓取xml url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!