Heroku 和 Web 抓取 [英] Heroku and Web scraping

查看:26
本文介绍了Heroku 和 Web 抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 nokigiri 网络爬虫,它发布到我试图发布到 heroku 的数据库.我有一个 sinatra 应用程序前端,我想从数据库中提取它.我是 Heroku 和 Web 开发的新手,不知道处理此类问题的最佳方法.

I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.

我是否必须将上传到数据库的 Web 抓取脚本放在 sinatra 路由(如 mywebsite.com/scraper )下,然后让它变得如此模糊以至于没人访问它?最后,我想让 sinatra 部分成为从数据库中提取的 rest api.

Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.

感谢大家的投入

推荐答案

您可以采取两种方法.

第一个是通过使用heroku run YOURCMD通过控制台运行scraper来使用一次性dynos.只要确保刮板不写入磁盘而是使用数据库.

The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD. Just make sure scraper don't write to disk but uses database.

更多信息:https://devcenter.heroku.com/articles/one-off-dynos

第二种是通过一种方式区分刮刀和网络进程,即您拥有用于正常 UI 交互的网络进程和网络进程可以生成/对话的刮刀进程.如果您采用这条路线,则由您决定如何保护它免受世界其他地方的影响(身份验证/网址混淆等).

The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).

更多信息:https://devcenter.heroku.com/articles/background-jobs-queueing

这篇关于Heroku 和 Web 抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆