用scrapy刮擦项目euler网站 [英] Scraping project euler site with scrapy

查看:126
本文介绍了用scrapy刮擦项目euler网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用python的scrapy库刮取projecteuler.net,只是为了练习它。我在网上看到过不止一个这样的刮板的实施,但他们似乎对我来说太复杂了。我只想简单地将问题(标题,ID,内容)保存在json中,然后在我的电脑上的本地网页中加载ajax。



我正在执行我的解决方案,无论如何我都会终止,但是由于我想发现更智能的方式来使用库,所以我要求您提出最智能程序与scrapy做这个工作(如果你想避免json的方式,并直接保存在html ...对我来说可能会更好)。



是我的第一个方法(不起作用):

$ p $ # - * - coding:utf-8 - * -
导入httplib2
导入请求
从eulerscraper.items导入scrapy
导入从scrapy.linkextractors导入问题
从scrapy.loader导入LinkExtractor
import ItemLoader
from scrapy.spiders import CrawlSpider,Rule


def start_urls_detection():
#su = ['https://projecteuler.net/archives','https:// projecteuler .net / archives; page = 2']
#i = 1

#while True:
#request = requests.get(su [i])

#如果request.status_code!= 200:
#break

#i + = 1
#su.append('https://projecteuler.net/archives;page='+ str(i + 1))

return [https://projecteuler.net/]


class EulerSpider(CrawlSpider):
name ='euler'
allowed_domains = ['projecteuler.net']
start_urls = start_urls_detection()

规则=(
#提取匹配'category.php'的链接(但不匹配'subsection.php' )
#并跟随它们的链接(因为没有回调意味着默认遵循= True)。
#Rule(LinkExtractor(allow =('category\.php',),deny =('subsection\.php',))),
Rule(LinkExtractor(allow =('problem = \ d *',)),callback =parse_problems),
Rule(LinkExtractor(allow =('archives; page = \ d *',),unique = True),follow = True)


def start_requests(self):
#su = ['https://projecteuler.net/archives','https://projecteuler.net/archives; '=''
#i = 1

#而真:
#request = requests.get(su [i])

#if request.status_code!= 200:
#break

#i + = 1
#su.append('https://projecteuler.net/archives; page ='+ str(i + 1))

return [scrapy.Request(https://projecteuler.net/archives,self.parse)]

def parse_problems(self,response):
l = ItemLoader(item = Problem(),response = respons e)
.add_css(title,h2)
l.add_css(id,#problem_info)
l.add_css(content,.problem_content )

yield l.load_item()

#def parse_content(self,response):
##return response.css(div.problem_content ::文本)。extract()
#next_page =https://projecteuler.net/archives;page=2
#n = 3

#while n< ; 14:
#next_page = response.urljoin(next_page)
#yield scrapy.Request(next_page,callback = self.parse)
#next_page = next_page [0:len(next_page) - 1 ] + str(n)
#n + = 1

现在我会尝试一些linkExtractor +手动请求组合。与此同时,我希望等待您的解决方案...

解决方案

我想我找到了一个最简单但合适的解决方案至少为了我的目的),就现有代码写入刮板项目而言:

 # -  *  -  coding:utf-8  - *  -  
从eulerscraper.items导入scrapy
导入从scrapy.loader导入问题
import ItemLoader


class EulerSpider(scrapy.Spider):
name ='euler'
allowed_domains = ['projecteuler.net']
start_urls = [https://projecteuler.net/archives]

def parse(自我,回应):
numpag = response.css(div.pagination a [href] :: text)。extract()
maxpag = int(numpag [len(numpag) - 1])

for href in response.css(table#problems_table a :: attr(href))。extract():
next_page =https://projecteuler.net/+ href
产量response.follow(next_page,self.parse_problems)
$ (2,maxpag + 1)b $ b:
next_page =https://projecteuler.net/archives;page=+ str(i)
yield response.follow(next_page ,self.parse_next)

return [scrapy.Request(https://projecteuler.net/archives,self.parse)]

def parse_next(self,response ):
for href in response.css(table#problems_table a :: attr(href))。extract():
next_page =https://projecteuler.net/+ href
yield response.follow(next_page,self.parse_problems)

def parse_problems(self,response):
l = ItemLoader(item = Problem(),response = response)
l.add_css(title,h2)
l.add_css(id,#problem_info)
l.add_css(content,.problem_content)

yield l.load_item()

从开始页面(档案)单个链接到一个问题,通过 parse_probl获取我需要的数据EMS 。然后,我为网站的其他页面启动刮板,每个链接列表都使用相同的过程。
另外,具有前后处理的项目定义非常干净:


  import re 

从scrapy.loader.processors导入scrapy
导入MapCompose,从w3lib.html导入
import remove_tags


def extract_first_number(text):
i = re.search('\d +',text)
return int(text [i.start():i.end()])


def array_to_value(element ):
return元素[0]


class问题(scrapy.Item):
id = scrapy.Field(
input_processor = MapCompose(remove_tags ,extract_first_number),
output_processor = Compose(array_to_value)

title = scrapy.Field(input_processor = MapCompose(remove_tags))
content = scrapy.Field()

我使用 scrapy抓取euler -o euler.json ,它输出一个无序的json对象数组,每个人都相当于一个单一的问题:这对我来说很好,因为e我将用javascript处理它,即使我认为通过scrapy解决订购问题可能非常简单。

编辑:其实它很简单,使用此流水线

$ $ p $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ def open_spider(self,spider):
self.list_items = []
self.file = open('euler.json','w')

def close_spider(self ,spider):
ordered_list = [无我为范围(len(self.list_items))]

self.file.write([\\\


for self.list_items:
ordered_list [int(i ['id'] - 1)] = json.dumps(dict(i))

for i in ordered_list:
self.file.write(str(i)+,\\\


self.file.write(] \)
self .file.close()

def process_item(self,item,spider):
self.list_items.append(item)
return item

尽管最好的解决方案可能是创建一个自定义导出器:

  from scrapy.exporters从scrapy.utils.python导入JsonItemExporter 
import to_bytes

$ b class OrderedJsonItemExporter(JsonItemExporter):

def __init __(self,文件,** kwargs):
#初始化我们使用JsonItemExporter的构造函数的对象
super().__ init __(file)
self.list_items = []

def export_item(self,item):
self.list_items.append(item)
$ b $ def finish_exporting(self):
ordered_list = [None for i in range(len(self .list_items))]

for self.list_items:
ordered_list [int(i ['id'] - 1)] = i

for i在ordered_list中:
if self.first_item:
self.first_item = False
else:
self.file.write(b',')
self._beautify_ne wline()
itemdict = dict(self._get_serialized_fields(i))
data = self.encoder.encode(itemdict)
self.file.write(to_bytes(data,self.encoding) )

self._beautify_newline()
self.file.write(b])

并在设置中对其进行配置,以便为其调用json:

  FEED_EXPORTERS = {
'json':'eulerscraper.exporters.OrderedJsonItemExporter',
}


I'm trying to scrape projecteuler.net with python's scrapy library, just to make practice with it. I've seen online more than one existent implementation of such a scraper, but they seem just too much elaborated for me. I want simply to save the problems (titles, ids, contents) in a json and next loading with ajax in a local webpage on my pc.

I'm implementing my solution that I will terminate anyway, but since I want to discover the smarter way to use the library, I'm asking you to propose the most intelligent programs with scrapy for doing this job (if you want to avoid the json way, and save directly in html... for me may be even better).

This is my first approach (doesn't work):

# -*- coding: utf-8 -*-
import httplib2
import requests
import scrapy
from eulerscraper.items import Problem
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule


def start_urls_detection():
    # su = ['https://projecteuler.net/archives', 'https://projecteuler.net/archives;page=2']
    # i = 1
    #
    # while True:
    #     request = requests.get(su[i])
    #
    #     if request.status_code != 200:
    #         break
    #
    #     i += 1
    #     su.append('https://projecteuler.net/archives;page=' + str(i + 1))

    return ["https://projecteuler.net/"]


class EulerSpider(CrawlSpider):
    name = 'euler'
    allowed_domains = ['projecteuler.net']
    start_urls = start_urls_detection()

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        # Rule(LinkExtractor(allow=('category\.php',), deny=('subsection\.php',))),
        Rule(LinkExtractor(allow=('problem=\d*',)), callback="parse_problems"),
        Rule(LinkExtractor(allow=('archives;page=\d*',), unique=True), follow=True)
    )

    def start_requests(self):
        # su = ['https://projecteuler.net/archives', 'https://projecteuler.net/archives;page=2']
        # i = 1
        #
        # while True:
        #     request = requests.get(su[i])
        #
        #     if request.status_code != 200:
        #         break
        #
        #     i += 1
        #     su.append('https://projecteuler.net/archives;page=' + str(i + 1))

        return [scrapy.Request("https://projecteuler.net/archives", self.parse)]

    def parse_problems(self, response):
        l = ItemLoader(item=Problem(), response=response)
        l.add_css("title", "h2")
        l.add_css("id", "#problem_info")
        l.add_css("content", ".problem_content")

        yield l.load_item()

    # def parse_content(self, response):
    #     #return response.css("div.problem_content::text").extract()
    #     next_page = "https://projecteuler.net/archives;page=2"
    #     n = 3
    #
    #     while n < 14:
    #         next_page = response.urljoin(next_page)
    #         yield scrapy.Request(next_page, callback=self.parse)
    #         next_page = next_page[0:len(next_page) - 1] + str(n)
    #         n += 1

now I will try with some linkExtractor + manual requests combo. In the meantime, I hopefully wait for your solutions...

解决方案

I think I have found a simplest yet fitting solution (at least for my purpose), in respect to existent code written to scrape projecteuler:

# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader


class EulerSpider(scrapy.Spider):
    name = 'euler'
    allowed_domains = ['projecteuler.net']
    start_urls = ["https://projecteuler.net/archives"]

    def parse(self, response):
        numpag = response.css("div.pagination a[href]::text").extract()
        maxpag = int(numpag[len(numpag) - 1])

        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

        for i in range(2, maxpag + 1):
            next_page = "https://projecteuler.net/archives;page=" + str(i)
            yield response.follow(next_page, self.parse_next)

        return [scrapy.Request("https://projecteuler.net/archives", self.parse)]

    def parse_next(self, response):
        for href in response.css("table#problems_table a::attr(href)").extract():
            next_page = "https://projecteuler.net/" + href
            yield response.follow(next_page, self.parse_problems)

    def parse_problems(self, response):
        l = ItemLoader(item=Problem(), response=response)
        l.add_css("title", "h2")
        l.add_css("id", "#problem_info")
        l.add_css("content", ".problem_content")

        yield l.load_item()

From the start page (archives) I follow every single link to a problem, scraping the data that I need with parse_problems. Then I launch the scraper for the other pages of the site, with the same procedure for every list of link. Also the Item definition with pre and post processes is very clean:

import re

import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags


def extract_first_number(text):
    i = re.search('\d+', text)
    return int(text[i.start():i.end()])


def array_to_value(element):
    return element[0]


class Problem(scrapy.Item):
    id = scrapy.Field(
        input_processor=MapCompose(remove_tags, extract_first_number),
        output_processor=Compose(array_to_value)
    )
    title = scrapy.Field(input_processor=MapCompose(remove_tags))
    content = scrapy.Field()

I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.

EDIT: in fact it is simple, using this pipeline

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.list_items = []
        self.file = open('euler.json', 'w')

    def close_spider(self, spider):
        ordered_list = [None for i in range(len(self.list_items))]

        self.file.write("[\n")

        for i in self.list_items:
            ordered_list[int(i['id']-1)] = json.dumps(dict(i))

        for i in ordered_list:
            self.file.write(str(i)+",\n")

        self.file.write("]\n")
        self.file.close()

    def process_item(self, item, spider):
        self.list_items.append(item)
        return item

though the best solution may be to create a custom exporter:

from scrapy.exporters import JsonItemExporter
from scrapy.utils.python import to_bytes


class OrderedJsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object we use JsonItemExporter's constructor
        super().__init__(file)
        self.list_items = []

    def export_item(self, item):
        self.list_items.append(item)

    def finish_exporting(self):
        ordered_list = [None for i in range(len(self.list_items))]

        for i in self.list_items:
            ordered_list[int(i['id'] - 1)] = i

        for i in ordered_list:
            if self.first_item:
                self.first_item = False
            else:
                self.file.write(b',')
                self._beautify_newline()
            itemdict = dict(self._get_serialized_fields(i))
            data = self.encoder.encode(itemdict)
            self.file.write(to_bytes(data, self.encoding))

        self._beautify_newline()
        self.file.write(b"]")

and configure it in settings to call it for json:

FEED_EXPORTERS = {
    'json': 'eulerscraper.exporters.OrderedJsonItemExporter',
}

这篇关于用scrapy刮擦项目euler网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆