Scrapy Linkextractor 复制(?) [英] Scrapy Linkextractor duplicating(?)

查看:21
本文介绍了Scrapy Linkextractor 复制(?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的爬虫实现如下.

它正在工作,它将通过受链接提取器监管的网站.

It is working and it would go through sites regulated under the link extractor.

基本上我想做的是从页面的不同位置提取信息:

Basically what I am trying to do is to extract information from different places in the page:

- 'news' 类下的 href 和 text()(如果存在)

- 'think block' 类下的图像 url(如果存在)

我的scrapy有三个问题:

I have three problems for my scrapy:

1) 复制链接提取器

它似乎会复制处理过的页面.(我检查导出文件,发现同一个~.img出现了很多次,几乎不可能)

It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible)

事实上,对于网站的每个页面,底部都有超链接,方便用户直接访问他们感兴趣的主题,而我的目标是从主题页面中提取信息(这里列出了几段同一主题下的标题)和文章页面中的图片(您可以通过点击主题页面上的文章标题到达文章页面).

And the fact is , for every page in the website, there are hyperlinks at the bottom that facilitate users to direct to the topic they are interested in, while my objective is to extract information from the topic's page ( here listed several passages's title under the same topic ) and the images found within a passage's page( you can arrive to the passage's page by clicking on the passage's title found at topic page).

我怀疑在这种情况下链接提取器会再次循环同一页面.

I suspect link extractor would loop the same page over again in this case.

(也许用depth_limit解决?)

2) 改进 parse_item

我认为 parse_item 的效率很低.我该如何改进它?我需要从网络中的不同位置提取信息(当然它只在存在时才提取).此外,看起来 parse_item 只能处理 HkejImage 而不能处理 HkejItem(我再次检查了输出文件).我应该如何解决这个问题?

I think it is quite not efficient for parse_item. How could I improve it? I need to extract information from different places in the web ( for sure it only extracts if it exists).Beside, it looks like that the parse_item could only progress HkejImage but not HkejItem (again I checked with the output file). How should I tackle this?

3) 我需要蜘蛛能够阅读中文.

我正在香港的一个网站上爬行,因此能够阅读中文是必不可少的.

I am crawling a site in HK and it would be essential to be capable to read Chinese.

网站:

http://www1.hkej.com/dailynews/headline/article/1105148/IMF%E5%82%B3%E4%BF%83%E4%B8%AD%E5%9C%8B%E9%80%80%E5%87%BA%E6%95%91%E5%B8%82

只要属于每日新闻",就是我想要的.

As long as it belongs to 'dailynews', that's the thing I want.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
import items


class EconjournalSpider(CrawlSpider):
    name = "econJournal"
    allowed_domains = ["hkej.com"]
    login_page = 'http://www.hkej.com/template/registration/jsp/login.jsp'
    start_urls =  'http://www.hkej.com/dailynews'

    rules=(Rule(LinkExtractor(allow=('dailynews', ),unique=True), callback='parse_item', follow =True),
           )


    def start_requests(self):
         yield Request(
         url=self.login_page,
         callback=self.login,
         dont_filter=True
         )
# name column
    def login(self, response):
        return FormRequest.from_response(response,
                    formdata={'name': 'users', 'password': 'my password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "username" in response.body:       
            self.log("


Successfully logged in. Let's start crawling!


")
            return Request(url=self.start_urls)
        else:
            self.log("


You are not logged in.


")
            # Something went wrong, we couldn't log in, so nothing happens

    def parse_item(self, response):
        hxs = Selector(response)
        news=hxs.xpath("//div[@class='news']")
        images=hxs.xpath('//p')

        for image in images:
            allimages=items.HKejImage()
            allimages['image'] = image.xpath('a/img[not(@data-original)]/@src').extract()
            yield allimages

        for new in news:
            allnews = items.HKejItem()
            allnews['news_title']=new.xpath('h2/@text()').extract()
            allnews['news_url'] = new.xpath('h2/@href').extract()
            yield allnews

非常感谢,我将不胜感激!

Thank you very much and I would appreciate any help!

推荐答案

首先,要设置设置,在 settings.py 文件中进行设置,或者您可以指定 custom_settings> 蜘蛛上的参数,如:

First, to set settings, make it on the settings.py file or you can specify the custom_settings parameter on the spider, like:

custom_settings = {
    'DEPTH_LIMIT': 3,
}

然后,您必须确保蜘蛛到达 parse_item 方法(我认为它没有,尚未测试).而且你不能在规则上指定 callbackfollow 参数,因为它们不能一起工作.

Then, you have to make sure the spider is reaching the parse_item method (which I think it doesn't, haven't tested yet). And also you can't specify the callback and follow parameters on a rule, because they don't work together.

首先删除规则上的follow,或添加另一条规则,以检查要关注的链接以及作为项目返回的链接.

First remove the follow on your rule, or add another rule, to check which links to follow, and which links to return as items.

第二个在你的 parse_item 方法上,你得到的 xpath 不正确,要获取所有图像,也许你可以使用类似的东西:

Second on your parse_item method, you are getting incorrect xpath, to get all the images, maybe you could use something like:

images=hxs.xpath('//img')

然后获取图片网址:

allimages['image'] = image.xpath('./@src').extract()

对于新闻,这看起来可行:

for the news, it looks like this could work:

allnews['news_title']=new.xpath('.//a/text()').extract()
allnews['news_url'] = new.xpath('.//a/@href').extract()

现在,了解您的问题,这不是 Linkextractor duplicating 错误,而是规则规范不佳,还要确保您具有 有效的 xpath,因为您的问题并未表明您需要更正 xpath.

Now, as and understand your problem, this isn't a Linkextractor duplicating error, but only poor rules specifications, also make sure you have valid xpath, because your question didn't indicate you needed xpath correction.

这篇关于Scrapy Linkextractor 复制(?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆