Scrapy-如何从类别中提取所有博客文章? [英] Scrapy- How to extract all blog posts from a category?

查看:39
本文介绍了Scrapy-如何从类别中提取所有博客文章?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy来提取我博客的所有帖子.问题是我不知道如何创建一个规则来读取任何给定博客类别中的所有帖子?

I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category?

示例:在我的博客上,环境设置"类别有 17 个帖子.所以在scrapy代码中,我可以按照给定的方式对其进行硬编码,但这不是一个非常实用的方法

example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach

start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"] 

我已经阅读了与此问题相关的类似帖子,例如 12345, 6, 7,但我似乎无法找到我的答案.如您所见,唯一的区别是上述 url 中的页数.如何在scrapy中编写可以读取类别中所有博客文章的规则.还有一个微不足道的问题,我如何配置爬虫来抓取我的博客,这样当我发布一个新的博客文章时,爬虫可以立即检测到它并将其写入文件.

I have read similar posts related to this question posted here on SO like 1, 2, 3, 4, 5, 6, 7, but I cant seem to find out my answer in any. As you can see, the only difference is the page count in the above url's. How can I write a rule in scrapy that can read all the blog posts in a category. And another trivial question, how can I configure the spider to crawl my blog such that when I post a new blog post entry, the crawler can immediately detect it an write it to a file.

这就是我目前所拥有的蜘蛛类

This is what I have so far for the spider class

from BlogScraper.items import BlogscraperItem
from scrapy.spiders import CrawlSpider,Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request


class MySpider(CrawlSpider):
    name = "nextpage" # give your spider a unique name because it will be used for crawling the webpages

    #allowed domain restricts the spider crawling
    allowed_domains=["https://edumine.wordpress.com/"]
    # in start_urls you have to specify the urls to crawl from
    start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/"]

    '''
    start_urls=["https://edumine.wordpress.com/category/ide-configuration/environment-setup/",
                "https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/",
                "https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"]


    rules = [
                Rule(SgmlLinkExtractor
                    (allow=("https://edumine.wordpress.com/category/ide-configuration/environment-setup/\d"),unique=False,follow=True))
            ]
'''
    rules= Rule(LinkExtractor(allow='https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/'),follow=True,callback='parse_page')

    def parse_page(self, response):

        hxs=Selector(response)
        titles = hxs.xpath("//h1[@class='entry-title']")
        items = []
        with open("itemLog.csv","w") as f:
             for title in titles:
                item = BlogscraperItem()
                item["post_title"] = title.xpath("//h1[@class='entry-title']//text()").extract()
                item["post_time"] = title.xpath("//time[@class='entry-date']//text()").extract()
                item["text"]=title.xpath("//p//text()").extract()
                item["link"] = title.select("a/@href").extract()

                items.append(item)

                f.write('post title: {0}\n, post_time: {1}\n, post_text: {2}\n'.format(item['post_title'], item['post_time'],item['text']))
                print "#### \tTotal number of posts= ",len(items), " in category####"


        f.close()

有什么帮助或建议可以解决吗?

Any help or suggestions to solve it?

推荐答案

您的代码中有一些可以改进的地方,还有两个您想要解决的问题:阅读帖子、自动抓取.

You have some things you can improve in your code and two problems you want to solve: reading posts, automatic crawling.

如果您想获取新博客文章的内容,您必须重新运行您的蜘蛛.否则你将陷入无限循环.当然,在这种情况下,您必须确保不会抓取已经抓取的条目(数据库,在蜘蛛启动时读取可用文件等).但是你不能有一个永远运行并等待新条目的蜘蛛.这不是目的.

If you want to get the contents of a new blog post you have to re-run your spider. Otherwise you would have an endless loop. Naturally in this case you have to make sure that you do not scrape already scraped entries (database, read available files at spider start and so on). But you cannot have a spider which runs forever and waits for new entries. This is not the purpose.

您将帖子存储到文件中的方法是错误的.这意味着为什么要抓取项目列表,然后什么都不做?为什么要在 parse_page 函数中保存项目?为此,有项目管道,你应该写一个并在那里进行导出.并且 f.close() 不是必需的,因为您使用了 with 语句,它会在最后为您执行此操作.

Your approach to store the posts into a file is wrong. This means why do you scrape a list of items and then do nothing with them? And why do you save the items in the parse_page function? For this there are item pipelines, you should write one and do there the exporting. And the f.close() is not necessary because you use the with statement which does this for you at the end.

您的 rules 变量应该抛出错误,因为它不可迭代.我想知道你是否甚至测试过你的代码.而且规则太复杂了.您可以将其简化为:

Your rules variable should throw an error because it is not iterable. I wonder if you even tested your code. And the Rule is too complex. You can simplify it to this:

rules = [Rule(LinkExtractor(allow='page/*'), follow=True, callback='parse_page'),]

它跟在每个包含 /page 的 URL 后面.

And it follows every URL which has /page in it.

如果您启动您的抓取工具,您将看到结果因您允许的域而被过滤:

If you start your scraper you will see that the results are filtered because of your allowed domains:

Filtered offsite request to 'edumine.wordpress.com': <GET https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/>

要解决此问题,请将您的域更改为:

To solve this change your domain to:

allowed_domains = ["edumine.wordpress.com"]

如果你想获得其他 wordpress 站点,只需将其更改为

If you want to get other wordpress sites, change it simply to

allowed_domains = ["wordpress.com"]

这篇关于Scrapy-如何从类别中提取所有博客文章?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆