使用客户端输出刮取页面的高效练习? [英] Efficient practice to scrape a page with Client-side output?

查看:112
本文介绍了使用客户端输出刮取页面的高效练习?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要一个每小时刮掉某个网页的脚本,并会在该页面内查找某个字符串。

I want a script that will scrape a certain web page every hour, and will look for a certain string inside that page.

然而,当我进入该页面时并使用`view:source',我在源代码中看不到该字符串。我被告知这是因为我正在寻找的字符串来自客户端(javascript)呈现的元素,因此我可以看到只有当我用Chrome控制台手动检查该元素时才会这样做。

However, when I enter that page and use `view:source", I cannot see that string in the source. I was told that it's because the string I'm looking for comes from an element that is rendered on the client side (javascript), and thus I can see it only when I manually inspect that element with Chrome console for example.

哪种练习/编程语言/环境,最有效的实现我想要的,考虑到我想从我的webhost服务器运行该脚本,该服务器有2.25GB内存?

Which practice / programming language / environment, would be the most efficient to achieve what I want, considering that I want to run that script from my webhost server, which has 2.25GB RAM?

有人建议我使用Pyqt4,但我的网络主机警告我这将是杀死我的RAM并损害服务器性能。我应该注意,脚本应该非常简单,并且每小时只扫描一个页面。

Someone suggested that I will use Pyqt4, but my web-host warned me that this will kill my RAM and hurt server performance. I should note that the script supposed to be very simple, and scrape only a single page, once in an hour.

推荐答案

看来问题就好了d可以使用 PhantomJS 解决,因为它会模拟真实浏览器的操作,从客户端代码中提取信息。

It seems that problem could be solved with PhantomJS, as it mocks real browser's action, which extracts information from client code.

对于使用Javascript的PhantomJS,您可以检查 testing-javascript-with-phantomjs

For PhantomJS with Javascript, you may check testing-javascript-with-phantomjs

关于如何在python中使用PhantomJS,请查看这个

For how to use PhantomJS with python, please take a look at this

希望它有所帮助〜

这篇关于使用客户端输出刮取页面的高效练习?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆