Heroku和Web刮 [英] Heroku and Web scraping
问题描述
我有一个nokigiri网络抓取工具,发布到我正在尝试发布到heroku的数据库中。我有一个我想从数据库中提取的sinatra应用程序前端。我是Heroku和Web开发新手,并不知道处理这种事情的最佳方式。
我必须放置上传的网页抓取脚本到sinatra路线下的数据库(如mywebsite.com/scraper),并使其变得如此晦涩以至于没有人访问它?最后,我想让sinatra部分成为一个从数据库中提取出来的休息API。
感谢您的所有输入
您可以采取两种方法。
第一种方法是使用一次性dynos通过使用 heroku运行YOURCMD
来通过控制台运行scraper。请确保scraper不写入磁盘,但使用数据库。
更多信息:
https://devcenter.heroku.com/articles/one-off-dynos
第二种方式是区分刮板和Web过程,以便您具有用于正常UI交互的Web过程以及Web过程可以产生/交谈的刮板过程。如果你采取这种方式,它取决于你如何保护它免受世界其他地区的侵害(auth / url混淆等)。
更多信息:
< a href =https://devcenter.heroku.com/articles/background-jobs-queueing =nofollow> https://devcenter.heroku.com/articles/background-jobs-queueing
I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.
Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.
Thanks for all input
There are two approaches you can take.
The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD
. Just make sure scraper don't write to disk but uses database.
More information: https://devcenter.heroku.com/articles/one-off-dynos
The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).
More information: https://devcenter.heroku.com/articles/background-jobs-queueing
这篇关于Heroku和Web刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!