在 Python 脚本中使用 Scrapy Spider 输出的问题 [英] Issue using Scrapy Spider Output in Python script

查看:61
本文介绍了在 Python 脚本中使用 Scrapy Spider 输出的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 python 脚本中使用蜘蛛的输出.为了实现这一点,我基于另一个线程.

我面临的问题是,spider_results() 函数只一遍又一遍地返回最后一项的列表,而不是包含所有找到的项的列表.当我使用 scrapy crawl 命令手动运行同一个蜘蛛时,我得到了所需的输出.脚本的输出、手动 json 输出和蜘蛛本身如下.

我的代码有什么问题?

来自scrapy导入信号从 scrapy.crawler 导入 CrawlerProcess从 scrapy.utils.project 导入 get_project_settings从 circus.spider.circus 导入 MySpider从scrapy.signalmanager 导入调度器def Spider_results():结果 = []def crawler_results(signal, sender, item, response, spider):结果.附加(项目)dispatcher.connect(crawler_results, 信号=signals.item_passed)process = CrawlerProcess(get_project_settings())process.crawl(MySpider)process.start() # 脚本会在这里阻塞,直到爬取完成返回结果如果 __name__ == '__main__':打印(蜘蛛结果())

脚本输出:

{'away_odds': 1.44,'away_team': '洛杉矶道奇队','event_time': datetime.datetime(2019, 6, 8, 2, 15),'home_odds':2.85,'home_team': '旧金山巨人队','last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),'联赛':'MLB'},{'客场赔率':1.44,'away_team': '洛杉矶道奇队','event_time': datetime.datetime(2019, 6, 8, 2, 15),'home_odds':2.85,'home_team': '旧金山巨人队','last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),'联赛':'MLB'},{'客场赔率':1.44,'away_team': '洛杉矶道奇队','event_time': datetime.datetime(2019, 6, 8, 2, 15),'home_odds':2.85,'home_team': '旧金山巨人队','last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),'联赛':'MLB'}]

带有scrapy crawl的Json输出:

<预><代码>[{"home_team": "洛杉矶天使队", "away_team": "西雅图水手队", "event_time": "2019-06-08 02:07:00", "home_odds": 1.58, "away_odds": 2.4, "last_update": "2019-06-06 20:48:16", "联赛": "MLB"},{"home_team": "San Diego Padres", "away_team": "Washington Nationals", "event_time": "2019-06-08 02:10:00", "home_odds": 1.87, "away_odds": 1.97, "last_update": "2019-06-06 20:48:16", "联赛": "MLB"},{"home_team": "旧金山巨人队", "away_team": "洛杉矶道奇队", "event_time": "2019-06-08 02:15:00", "home_odds": 2.85, "away_odds": 1.44,"last_update": "2019-06-06 20:48:16", "league": "MLB"}]

我的蜘蛛:

from scrapy.spiders import Spiderfrom ..items 导入 MatchItem导入json导入日期时间导入 dateutil.parser类MySpider(蜘蛛):名称 = 'first_spider'start_urls = ["https://websiteXYZ.com"]定义解析(自我,响应):item = MatchItem()时间戳 = datetime.datetime.utcnow()response_json = json.loads(response.body)对于 response_json["el"] 中的事件:对于活动中的团队[epl"]:如果 team["so"] == 1: item["home_team"] = team["pn"]if team["so"] == 2: item["away_team"] = team["pn"]对于事件 ["ml"] 中的市场:如果市场[mn"] ==匹配结果":item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)对于市场结果[msl"]:如果结果["mst"] == "1": item["home_odds"] = 结果["msp"]如果结果["mst"] == "X": item["draw_odds"] = 结果["msp"]如果结果["mst"] == "2": item["away_odds"] = 结果["msp"]如果市场["mn"] == 'Moneyline':item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)对于市场结果[msl"]:如果结果["mst"] == "1": item["home_odds"] = 结果["msp"]#if 结果["mst"] == "X": item["draw_odds"] = 结果["msp"]如果结果["mst"] == "2": item["away_odds"] = 结果["msp"]item["last_update"] = 时间戳项目[联赛"] = 事件[scn"]产量项目

根据下面的答案,我尝试了以下两个脚本:

控制器.py

导入json从scrapy导入信号从 scrapy.crawler 导入 CrawlerRunner从twisted.internet 进口反应堆,推迟从 betsson_controlled.spiders.betsson 导入 Betsson_Spider从 scrapy.utils.project 导入 get_project_settings类 MyCrawlerRunner(CrawlerRunner):def crawl(self, crawler_or_spidercls, *args, **kwargs):# 保持所有项目被刮掉self.items = []# 创建爬虫(与基础 CrawlerProcess 中相同)crawler = self.create_crawler(crawler_or_spidercls)# 处理抓取的每个项目crawler.signals.connect(self.item_scraped,signals.item_scraped)# 创建 Twisted.Deferred 启动抓取dfd = self._crawl(crawler, *args, **kwargs)# 添加回调 - 当抓取完成时 cal return_itemsdfd.addCallback(self.return_items)返回 dfddef item_scraped(self, item, response, spider):self.items.append(item)def return_items(self, result):返回 self.itemsdef return_spider_output(输出):返回 json.dumps([dict(item) for item in output])设置 = get_project_settings()runner = MyCrawlerRunner(设置)蜘蛛 = Betsson_Spider()延迟 = runner.crawl(spider)deferred.addCallback(return_spider_output)反应器运行()打印(延期)

当我执行controller.py时,我得到:

<延期在 0x7fb046e652b0 当前结果:'[{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "联赛":MLB"},{home_team":圣路易斯红雀队",away_team":匹兹堡海盗",home_odds":1.71,away_odds":2.19,联赛":MLB"},{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St.路易斯红雀队",away_team":匹兹堡海盗",home_odds":1.71,away_odds":2.19,联赛":MLB"},{home_team":圣路易斯红雀队",away_team":"匹兹堡海盗队", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "联赛": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "匹兹堡海盗队", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}]'>

解决方案

最近的编辑:阅读CrawlerProcess vs CrawlerRunner 我意识到你可能想要 CrawlerProcess.我必须使用 runner,因为我需要 klein 才能使用延迟对象.Process 只需要scrapy,而运行者希望其他脚本/程序与之交互.希望这会有所帮助.

您需要修改 CrawlerRunner/Process 并使用信号和/或回调将项目传递到 CrawlerRunner 中的脚本中.

如何集成 Flask &Scrapy? 如果您查看顶部答案中的选项,那么带有扭曲 klein 和 scrapy 的选项就是您正在寻找的一个示例,因为除了在抓取后将其发送到 Klein http 服务器之外,它正在做同样的事情.您可以使用 CrawlerRunner 设置类似的方法,以便在爬行时将每个项目发送到您的脚本.注意:此特定问题在收集项目后将结果发送到 Klein Web 服务器.答案是制作一个 API,该 API 收集结果并等待抓取完成并将其作为转储发送到 JSON,但您可以将相同的方法应用于您的情况.主要看的是 CrawlerRunner 如何被子类化和扩展以添加额外的功能.

您想要做的是有一个单独的脚本,您可以执行该脚本导入您的 Spider 并扩展 CrawlerRunner.然后你执行这个脚本,它会启动你的 Twisted reactor 并使用你的自定义运行器启动爬行过程.

也就是说——这个问题可能可以在项目管道中解决.创建自定义项目管道并将项目传递到您的脚本中,然后再返回项目.

# main.py导入json从scrapy导入信号从 scrapy.crawler 导入 CrawlerProcess从twisted.internet 导入反应堆,推迟# 我们错过的导入从 myproject.spider.mymodule 导入 MySpiderName从 scrapy.utils.project 导入 get_project_settings类 MyCrawlerProcess(CrawlerProcess):def crawl(self, crawler_or_spidercls, *args, **kwargs):# 保持所有项目被刮掉self.items = []crawler = self.create_crawler(crawler_or_spidercls)crawler.signals.connect(self.item_scraped,signals.item_scraped)dfd = self._crawl(crawler, *args, **kwargs)dfd.addCallback(self.return_items)返回 dfddef item_scraped(self, item, response, spider):self.items.append(item)def return_items(self, result):返回 self.itemsdef return_spider_output(输出):返回 json.dumps([dict(item) for item in output])进程 = MyCrawlerProcess()延迟 = process.crawl(MySpider)deferred.addCallback(return_spider_output)process.start() - 脚本应该在这里再次阻塞,但我不确定它是否可以在不使用 reactor.run() 的情况下正常工作打印(延期)

同样,这段代码是我没有测试过的猜测.我希望它能让你朝着更好的方向前进.

参考文献:

I want to use the ouput from a spider inside a python script. To accomplish this, I wrote the following code based on another thread.

The issue I'm facing is that the function spider_results() only returns a list of the last item over and over again instead of a list with all the found items. When I run the same spider manually with the scrapy crawl command, I get the desired output. The output of the script, the manual json output and the spider itself are below.

What's wrong with my code?

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from circus.spiders.circus import MySpider

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)


    dispatcher.connect(crawler_results, signal=signals.item_passed)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

Script output:

{'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}, {'away_odds': 1.44,
 'away_team': 'Los Angeles Dodgers',
 'event_time': datetime.datetime(2019, 6, 8, 2, 15),
 'home_odds': 2.85,
 'home_team': 'San Francisco Giants',
 'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
 'league': 'MLB'}]

Json output with scrapy crawl:

[
{"home_team": "Los Angeles Angels", "away_team": "Seattle Mariners", "event_time": "2019-06-08 02:07:00", "home_odds": 1.58, "away_odds": 2.4, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Diego Padres", "away_team": "Washington Nationals", "event_time": "2019-06-08 02:10:00", "home_odds": 1.87, "away_odds": 1.97, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Francisco Giants", "away_team": "Los Angeles Dodgers", "event_time": "2019-06-08 02:15:00", "home_odds": 2.85, "away_odds": 1.44, "last_update": "2019-06-06 20:48:16", "league": "MLB"}
]

MySpider:

from scrapy.spiders import Spider
from ..items import MatchItem
import json
import datetime
import dateutil.parser

class MySpider(Spider):
    name = 'first_spider'

    start_urls = ["https://websiteXYZ.com"]

    def parse(self, response):
        item = MatchItem()

        timestamp = datetime.datetime.utcnow()

        response_json = json.loads(response.body)

        for event in response_json["el"]:
            for team in event["epl"]:
                if team["so"] == 1: item["home_team"] = team["pn"]
                if team["so"] == 2: item["away_team"] = team["pn"]

            for market in event["ml"]:
                if market["mn"] == "Match result":
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]

                if market["mn"] == 'Moneyline':
                    item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
                    for outcome in market["msl"]:
                        if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
                        #if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
                        if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]


            item["last_update"] = timestamp
            item["league"] = event["scn"]

            yield item

Edit:

Based on the answer below, I tried the following two scripts:

controller.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from betsson_controlled.spiders.betsson import Betsson_Spider
from scrapy.utils.project import get_project_settings


class MyCrawlerRunner(CrawlerRunner):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items

def return_spider_output(output):
    return json.dumps([dict(item) for item in output])

settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = Betsson_Spider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)


reactor.run()
print(deferred)

When I execute controller.py, I get:

<Deferred at 0x7fb046e652b0 current result: '[{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}]'>

解决方案

RECENT EDITS: After reading CrawlerProcess vs CrawlerRunner I realized that you probably want CrawlerProcess. I had to use runner since I needed klein to be able to use the deferred object. Process expects only scrapy where was runner expects other scripts/programs to interact with. Hope this helpss.

You need to modify CrawlerRunner/Process and use signals and or callbacks to pass the item into your script in the CrawlerRunner.

How to integrate Flask & Scrapy? If you look at the options in the top answer the one with twisted klein and scrapy is an example of what you are looking for since it is doing the same thing except sending it to a Klein http server after the crawl. You can setup a similar method with the CrawlerRunner to send each item to your script as it is crawling. NOTE: This particular question sends the results to Klein web server after the items are collected. The answer is for making an API which collects the results and waits until crawling is done and sends it as dumps it to JSON, but you can apply this same method to your situation. The main thing to look at is how CrawlerRunner was sub-classed and extended to add the extra functionality.

What you want to be doing is have a separate script which you execute which imports your Spider and extends CrawlerRunner. Then you execute this script it will start your Twisted reactor and start the crawl process using your cutomized runner.

That said -- this problem could probably be solved in an item pipeline. Create a custom item pipeline and pass the item into your script before returning the item.

# main.py

import json
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from twisted.internet import reactor, defer # import we missed
from myproject.spiders.mymodule import MySpiderName
from scrapy.utils.project import get_project_settings


class MyCrawlerProcess(CrawlerProcess):
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        crawler = self.create_crawler(crawler_or_spidercls)

        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        dfd = self._crawl(crawler, *args, **kwargs)

        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    return json.dumps([dict(item) for item in output])


process = MyCrawlerProcess()
deferred = process.crawl(MySpider)
deferred.addCallback(return_spider_output)


process.start() - Script should block here again but I'm not sure if it will work right without using reactor.run()
print(deferred)

Again, this code is a guess I havent tested. I hope it sets you in a better direction.

References:

这篇关于在 Python 脚本中使用 Scrapy Spider 输出的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆