Heroku和Web刮 [英] Heroku and Web scraping

查看:144
本文介绍了Heroku和Web刮的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个nokigiri网络抓取工具,发布到我正在尝试发布到heroku的数据库中。我有一个我想从数据库中提取的sinatra应用程序前端。我是Heroku和Web开发新手,并不知道处理这种事情的最佳方式。



我必须放置上传的网页抓取脚本到sinatra路线下的数据库(如mywebsite.com/scraper),并使其变得如此晦涩以至于没有人访问它?最后,我想让sinatra部分成为一个从数据库中提取出来的休息API。



感谢您的所有输入
您可以采取两种方法。

第一种方法是使用一次性dynos通过使用 heroku运行YOURCMD 来通过控制台运行scraper。请确保scraper不写入磁盘,但使用数据库。



更多信息:
https://devcenter.heroku.com/articles/one-off-dynos



第二种方式是区分刮板和Web过程,以便您具有用于正常UI交互的Web过程以及Web过程可以产生/交谈的刮板过程。如果你采取这种方式,它取决于你如何保护它免受世界其他地区的侵害(auth / url混淆等)。

更多信息:
< a href =https://devcenter.heroku.com/articles/background-jobs-queueing =nofollow> https://devcenter.heroku.com/articles/background-jobs-queueing


I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.

Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.

Thanks for all input

解决方案

There are two approaches you can take.

The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD. Just make sure scraper don't write to disk but uses database.

More information: https://devcenter.heroku.com/articles/one-off-dynos

The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).

More information: https://devcenter.heroku.com/articles/background-jobs-queueing

这篇关于Heroku和Web刮的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆