为什么刮板式抓取器在Flask应用中只能运行一次? [英] Why does scrapy crawler only work once in flask app?
问题描述
我目前正在开发Flask应用.该应用程序从用户那里获取一个网址,然后抓取该网站并返回在该网站中找到的链接.这是我的代码:
I am currently working on a Flask app. The app takes a url from the user and then crawls that website and returns the links found in that website. This is what my code looks like:
from flask import Flask, render_template, request, redirect, url_for, session, make_response
from flask_executor import Executor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from urllib.parse import urlparse
from uuid import uuid4
import smtplib, urllib3, requests, urllib.parse, datetime, sys, os
app = Flask(__name__)
executor = Executor(app)
http = urllib3.PoolManager()
process = CrawlerProcess()
list = set([])
list_validate = set([])
list_final = set([])
@app.route('/', methods=["POST", "GET"])
def index():
if request.method == "POST":
url_input = request.form["usr_input"]
# Modifying URL
if 'https://' in url_input and url_input[-1] == '/':
url = str(url_input)
elif 'https://' in url_input and url_input[-1] != '/':
url = str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] != '/':
url = 'https://' + str(url_input) + '/'
elif 'https://' not in url_input and url_input[-1] == '/':
url = 'https://' + str(url_input)
# Validating URL
try:
response = requests.get(url)
error = http.request("GET", url)
if error.status == 200:
parse = urlparse(url).netloc.split('.')
base_url = parse[-2] + '.' + parse[-1]
start_url = [str(url)]
allowed_url = [str(base_url)]
# Crawling links
class Crawler(CrawlSpider):
name = "crawler"
start_urls = start_url
allowed_domains = allowed_url
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
base_url = url
href = response.xpath('//a/@href').getall()
list.add(urllib.parse.quote(response.url, safe=':/'))
for link in href:
if base_url not in link:
list.add(urllib.parse.quote(response.urljoin(link), safe=':/'))
for link in list:
if base_url in link:
list_validate.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in list_validate:
error = http.request("GET", link)
if error.status == 200:
list_final.add(link)
original_stdout = sys.stdout
with open('templates/file.txt', 'w') as f:
sys.stdout = f
for link in list_final:
print(link)
unique_id = uuid4().__str__()
executor.submit_stored(unique_id, start)
return redirect(url_for('crawling', id=unique_id))
else:
return render_template('index.html')
@app.route('/crawling-<string:id>')
def crawling(id):
if not executor.futures.done(id):
return render_template('start-crawl.html', refresh=True)
else:
executor.futures.pop(id)
return render_template('finish-crawl.html')
在我的 start.html
中,我有这个:
{% if refresh %}
<meta http-equiv="refresh" content="5">
{% endif %}
此代码从用户那里获取一个URL,对其进行验证,如果它是有效的URL,它将开始爬网,并将用户带到 start-crawl.html
页面.该页面每5秒钟刷新一次,直到抓取完成为止;如果抓取完成,它将呈现 finish-crawl.html
.用户可以在 finish-crawl.html
中下载具有输出内容的文件(由于不必要而未包含该文件).
This code takes a url from a user, validates it, and if it is a working url, it starts crawling and takes the user to start-crawl.html
page. The page refreshes every 5 seconds until the crawling is complete and if the crawling finishes it renders the finish-crawl.html
. In finish-crawl.html
, the user can download a file that has the output (didn't include it because it isn't necessary).
一切正常.我的问题是,一旦我爬网一个网站并且完成爬网,并且我在 finish-crawl.html
上,就无法爬网另一个网站.如果我返回主页并输入另一个URL,它将验证该URL,然后直接进入 finish-crawl.html
.我认为发生这种情况是因为scrappy只能运行一次,而且反应堆无法重新启动,这就是我在这里尝试做的事情.那么有人知道我能做些什么来解决这个问题吗?请忽略代码的复杂性以及任何不被视为编程约定"的内容.
Everything works as expected. My problem is once I crawl a website and it finishes crawling and I am at the finish-crawl.html
, I can't crawl another website. If I go back to the home page and enter another url, it validates the url and then goes directly to finish-crawl.html
. I think this happens because scrappy can only be run once and the reactor isn't restartable which is what I am trying to do here. So does anyone know what I can do to fix this? Please ignore the complicity of the code and anything that isn't considered "a programming convention".
推荐答案
Scrapy 建议使用 CrawlerRunner
而不是 CrawlerProcess
.
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished(e):
print("finished")
def spider_error(e):
print("spider error :/")
d.addCallback(finished)
d.addErrback(spider_error)
reactor.run()
有关反应堆的更多信息,请参见: ReactorBasic
More information about reactor is available here:ReactorBasic
这篇关于为什么刮板式抓取器在Flask应用中只能运行一次?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!