Scrapy 教程示例 [英] Scrapy Tutorial Example
问题描述
看看是否有人可以为我指明在 Python 中使用 Scrapy 的正确方向.
Looking to see if someone can point me in the right direction in regards to using Scrapy in python.
我已经尝试遵循该示例几天了,但仍然无法获得预期的输出.使用了 Scrapy 教程,http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item,甚至从 github repo 下载一个确切的项目,但我得到的输出不是教程中描述的.
I've been trying to follow the example for several days and still can't get the output expected. Used the Scrapy tutorial, http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item, and even download an exact project from the github repo but the output I get is not of that described in the tutorial.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)
return items
从github下载项目后,在顶层目录运行scrapy crawl dmoz".我得到以下输出:
After I downloaded the project from github, I run "scrapy crawl dmoz" at the top level directory. I get the following output:
2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
['dirbot.pipelines.FilterWordsPipeline']
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on 128.1.2.1:2700
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16179,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)
按照教程期待这个:
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
推荐答案
教程中的这个蜘蛛好像已经过时了.该网站发生了一些变化,因此所有 xpath 现在都没有捕获任何内容.这很容易修复:
Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:
def parse(self, response):
sites = response.xpath('//div[@class="title-and-desc"]/a')
for site in sites:
item = dict()
item['name'] = site.xpath("text()").extract_first()
item['url'] = site.xpath("@href").extract_first()
item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
yield item
为了将来参考,您始终可以测试特定的 xpath 是否适用于 scrapy shell
命令.
例如我做了什么来测试这个:
For future reference you can always test whether a specific xpath works with scrapy shell
command.
e.g. what I did to test this:
$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li')
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!
这篇关于Scrapy 教程示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!