如何正确使用Rules、restrict_xpaths来抓取和解析带有scrapy的URL? [英] How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

查看:73
本文介绍了如何正确使用Rules、restrict_xpaths来抓取和解析带有scrapy的URL?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写爬行蜘蛛程序来爬取网站的 RSS 提要,然后解析文章的元标记.

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

第一个 RSS 页面是一个显示 RSS 类别的页面.我设法提取了链接,因为标签在标签中.它看起来像这样:

The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject1">subject1</a>
           </td>   
        </tr>
        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject2">subject2</a>
           </td>
        </tr>

点击该链接后,它会为您显示该 RSS 类别的文章,如下所示:

Once you click that link it brings you the the articles for that RSS category that looks like this:

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="http://example.com/article1">article1</a>
    </h4>
  </li>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="http://example.com/article2">article2</a>
     </h4>
  </li>

如你所见,如果我使用标签,我可以再次获得与 xpath 的链接我希望我的抓取工具转到该标签内的链接并为我解析元标签.

As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

这是我的爬虫代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_name'] =m.select('@name').extract()
           item['meta_value'] = m.select('@content').extract()
           items.append(item)
        return items

然而这是我运行爬虫时的输出:

However this is the output when I run the crawler:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

我在这里做错了什么?我一遍又一遍地阅读文档,但我觉得我一直在忽视一些东西.任何帮助将不胜感激.

What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

添加: items.append(item) .在原帖中忘记了.:我也试过了,结果是一样的:

Added: items.append(item) . Had forgotten it in original post. : I've tried this as well and it resulted in the same output:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)


    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] = m.select('@name').extract()
            item['meta_value'] = m.select('@content').extract()
            items.append(item)
        return items

推荐答案

您返回了一个空的items,您需要将item附加到items.
你也可以在循环中yield item.

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

这篇关于如何正确使用Rules、restrict_xpaths来抓取和解析带有scrapy的URL?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆