对在 Python 脚本中运行 Scrapy 感到困惑 [英] Confused about running Scrapy from within a Python script

查看:34
本文介绍了对在 Python 脚本中运行 Scrapy 感到困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遵循文档,我可以从Python脚本运行scrapy,但是我无法得到scrapy结果.

Following document, I can run scrapy from a Python script, but I can't get the scrapy result.

这是我的蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem

class DmozSpider(BaseSpider):
    name = "douban" 
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/group/xxx/discussion"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
        items = []
        # print sites
        for row in rows:
            item = DmozItem()
            item["title"] = row.select('text()').extract()[0]
            item["link"] = row.select('@href').extract()[0]
            items.append(item)

        return items

注意最后一行,我尝试使用返回的解析结果,如果我运行:

Notice the last line, I try to use the returned parse result, if I run:

 scrapy crawl douban

终端可以打印返回结果

但是我无法从 Python 脚本中获取返回结果.这是我的 Python 脚本:

But I can't get the return result from the Python script. Here is my Python script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")

我尝试在 reactor.run() 处得到结果,但它什么都不返回,

I try to get the result at the reactor.run(), but it return nothing,

我怎样才能得到结果?

推荐答案

终端打印结果,因为默认日志级别设置为 DEBUG.

Terminal prints the result because the default log level is set to DEBUG.

当您从脚本运行蜘蛛并调用 log.start() 时,默认日志级别设置为 INFO.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

只需替换:

log.start()

log.start(loglevel=log.DEBUG)

更新:

要将结果作为字符串,您可以将所有内容记录到文件中,然后从中读取,例如:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
    result = f.read()
print result

希望有所帮助.

这篇关于对在 Python 脚本中运行 Scrapy 感到困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆