为什么刮板式抓取器在Flask应用中只能运行一次? [英] Why does scrapy crawler only work once in flask app?

查看:53
本文介绍了为什么刮板式抓取器在Flask应用中只能运行一次?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在开发Flask应用.该应用程序从用户那里获取一个网址,然后抓取该网站并返回在该网站中找到的链接.这是我的代码:

I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like:

from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
from uuid import uuid4
import smtplib, urllib3, requests, urllib.parse, datetime, sys, os

app = Flask(__name__)
executor = Executor(app)

http = urllib3.PoolManager()
process = CrawlerProcess()

list = set([])
list_validate = set([])
list_final = set([])

@app.route('/', methods=["POST", "GET"])
def index():
   if request.method == "POST":
      url_input = request.form["usr_input"]

        # Modifying URL
        if 'https://' in url_input and url_input[-1] == '/':
            url = str(url_input)
        elif 'https://' in url_input and url_input[-1] != '/':
            url = str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] != '/':
            url = 'https://' + str(url_input) + '/'
        elif 'https://' not in url_input and url_input[-1] == '/':
            url = 'https://' + str(url_input)
        # Validating URL
        try:
            response = requests.get(url)
            error = http.request("GET", url)
            if error.status == 200:
                parse = urlparse(url).netloc.split('.')
                base_url = parse[-2] + '.' + parse[-1]
                start_url = [str(url)]
                allowed_url = [str(base_url)]

                # Crawling links
                class Crawler(CrawlSpider):
                    name = "crawler"
                    start_urls = start_url
                    allowed_domains = allowed_url
                    rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]

                    def parse_links(self, response):
                        base_url = url
                        href = response.xpath('//a/@href').getall()
                        list.add(urllib.parse.quote(response.url, safe=':/'))
                        for link in href:
                            if base_url not in link:
                                list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
                        for link in list:
                            if base_url in link:
                                list_validate.add(link)
                 def start():
                    process.crawl(Crawler)
                    process.start()

                    for link in list_validate:
                        error = http.request("GET", link)
                        if error.status == 200:
                            list_final.add(link)

                    original_stdout = sys.stdout
                    with open('templates/file.txt', 'w') as f:
                        sys.stdout = f
                        for link in list_final:
                           print(link)

                   unique_id = uuid4().__str__()
                   executor.submit_stored(unique_id, start)
                   return redirect(url_for('crawling', id=unique_id))
   else:
     return render_template('index.html')

@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
    return render_template('start-crawl.html', refresh=True)
else:
    executor.futures.pop(id)
    return render_template('finish-crawl.html')

在我的 start.html 中,我有这个:

{% if refresh %}
    <meta http-equiv="refresh" content="5">
{% endif %}

此代码从用户那里获取一个URL,对其进行验证,如果它是有效的URL,它将开始爬网,并将用户带到 start-crawl.html 页面.该页面每5秒钟刷新一次,直到抓取完成为止;如果抓取完成,它将呈现 finish-crawl.html .用户可以在 finish-crawl.html 中下载具有输出内容的文件(由于不必要而未包含该文件).

This code takes a url from a user, validates it, and if it is a working url, it starts crawling and takes the user to start-crawl.html page. The page refreshes every 5 seconds until the crawling is complete and if the crawling finishes it renders the finish-crawl.html. In finish-crawl.html, the user can download a file that has the output (didn't include it because it isn't necessary).

一切正常.我的问题是,一旦我爬网一个网站并且完成爬网,并且我在 finish-crawl.html 上,就无法爬网另一个网站.如果我返回主页并输入另一个URL,它将验证该URL,然后直接进入 finish-crawl.html .我认为发生这种情况是因为scrappy只能运行一次,而且反应堆无法重新启动,这就是我在这里尝试做的事情.那么有人知道我能做些什么来解决这个问题吗?请忽略代码的复杂性以及任何不被视为编程约定"的内容.

Everything works as expected. My problem is once I crawl a website and it finishes crawling and I am at the finish-crawl.html, I can't crawl another website. If I go back to the home page and enter another url, it validates the url and then goes directly to finish-crawl.html. I think this happens because scrappy can only be run once and the reactor isn't restartable which is what I am trying to do here. So does anyone know what I can do to fix this? Please ignore the complicity of the code and anything that isn't considered "a programming convention".

推荐答案

Scrapy 建议使用 CrawlerRunner 而不是 CrawlerProcess .

from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
        #Spider definition
        configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        def finished(e):
              print("finished")
        def spider_error(e):
              print("spider error :/")
        d.addCallback(finished)
        d.addErrback(spider_error)
        reactor.run() 

有关反应堆的更多信息,请参见: ReactorBasic

More information about reactor is available here:ReactorBasic

这篇关于为什么刮板式抓取器在Flask应用中只能运行一次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆