Scrapy XPath 页面上的所有链接 [英] Scrapy XPath all the links on the page

查看:47
本文介绍了Scrapy XPath 页面上的所有链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 收集域下的所有 URL.我试图使用 CrawlSpider 从主页开始并抓取他们的网络.对于每个页面,我想使用 Xpath 来提取所有的 href.并以键值对等格式存储数据.

I am trying to collect all the URLs under a domain using Scrapy. I was trying to use the CrawlSpider to start from the homepage and crawl their web. For each page, I want to use Xpath to extract all the hrefs. And store the data in a format like key-value pair.

Key:当前的Url值:此页面上的所有链接.

Key: the current Url Value: all the links on this page.

class MySpider(CrawlSpider):
    name = 'abc.com'
    allowed_domains = ['abc.com']
    start_urls = ['http://www.abc.com']

    rules = (Rule(SgmlLinkExtractor()), )
    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = AbcItem()
        item['key'] = response.url 
        item['value'] = hxs.select('//a/@href').extract()
        return item 

我定义我的 AbcItem() 如下所示:

I define my AbcItem() looks like below:

从scrapy.item导入项目,字段

from scrapy.item import Item, Field

class AbcItem(Item):

    # key: url
    # value: list of links existing in the key url
    key = Field()
    value = Field()
    pass

当我像这样运行我的代码时:

And when I run my code like this:

nohup scrapy crawl abc.com -o output -t csv &

机器人似乎开始爬行,我可以看到所有配置日志都填充了 nohup.out 文件,但是我的输出文件中没有信息..这是我正在尝试收集的信息,任何人都可以帮助我有了这个?我的机器人可能有什么问题?

The robot seems like began to crawl and I can see the nohup.out file being populated by all the configurations log but there is no information from my output file.. which is what I am trying to collect, can anyone help me with this? what might be wrong with my robot?

推荐答案

您应该已经为规则定义了回调.下面是从 twitter.com 主页 (follow=False) 获取所有链接的示例:

You should have defined a callback for a rule. Here's an example for getting all links from twitter.com main page (follow=False):

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url= Field()


class MySpider(CrawlSpider):
    name = 'twitter.com'
    allowed_domains = ['twitter.com']
    start_urls = ['http://www.twitter.com']

    rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=False), )

    def parse_url(self, response):
        item = MyItem()
        item['url'] = response.url
        return item

然后,在输出文件中,我看到:

Then, in the output file, I see:

http://status.twitter.com/
https://twitter.com/
http://support.twitter.com/forums/26810/entries/78525
http://support.twitter.com/articles/14226-how-to-find-your-twitter-short-code-or-long-code
...

希望有所帮助.

这篇关于Scrapy XPath 页面上的所有链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆