寻找一种使用JS抓取HTML的方法 [英] Looking for a way to scrape HTML with JS

查看:78
本文介绍了寻找一种使用JS抓取HTML的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

顾名思义,我正在寻找一种希望直接的方法来从网页上抓取所有HTML.也许将其存储在字符串中,然后在该字符串中导航以拉出所需的元素.

As the title suggests, I'm looking for a hopefully straightforward way of scraping all of the HTML from a webpage. Storing it in a string perhaps, and then navigating through that string to pull out the desired element.

具体来说,我想抓取我的Twitter页面,并在新的div中显示我的个人资料图片.我知道有几种工具可以做到这一点,但是我会有人给我一些代码示例或建议,以帮助我自己做到这一点吗?

Specifically, I want to scrape my twitter page and display my profile picture inside a new div. I know there are several tools for doing just this, but I would anyone have some code examples or suggestions for how I might do this myself?

非常感谢

更新

在T.J.我在网上进行了更多搜索,发现此资源.

After a very helpful response from T.J. Crowder I did some more searching online and found this resource.

推荐答案

从理论上讲,这很容易.您只需执行ajax调用即可获取页面文本,然后使用jQuery将其转换为断开连接的DOM,然后使用所有常用的jQuery工具查找并提取所需内容.

In theory, this is easy. You just do an ajax call to get the text of the page, then use jQuery to turn that into a disconnected DOM, and then use all the usual jQuery tools to find and extract what you need.

$.ajax({
    url:     "http://example.com/some/path",
    success: function(html) {
        var tree = $(html);
        var imgsrc = tree.find("img.some-class").attr("src");
        if (imgsrc) {
            // ...add the image to your page
        }
    }
});

但是 (这是一个大问题),由于 CORS 政策,但大多数不会,并且在IE8和IE9上支持CORS的课程需要一个额外的jQuery插件.

But (and it's a big one) it's not likely to work, because of the Same Origin Policy, which prevents cross-origin ajax calls. Certain individual sites may have an open CORS policy, but most won't, and of course supporting CORS on IE8 and IE9 requires an extra jQuery plug-in.

因此,对于不允许您通过CORS起源的网站,必须使用一台服务器.它可以是您的服​​务器,您可以使用服务器端代码获取想要的页面文本,然后通过ajax将其发送到您的页面(或者在您将页面中的内容构建到页面中时)首先渲染它).所有常用的服务器端堆栈(PHP,Node,ASP.Net,JVM等)都可以抓取网页.或者,在某些情况下,您可以使用 YQL作为跨域代理,使用其服务器而不是您自己的服务器.

So to do this with sites that don't allow your origin via CORS, there must be a server involved. It can be your server and you can grab the text of the page you want using server-side code and then send it to your page via ajax (or just build the bits you want into your page when you first render it). All of the usual server-side stacks (PHP, Node, ASP.Net, JVM, ...) have the ability to grab web pages. Or, in some cases, you may be able to use YQL as a cross-domain proxy, using their server rather than your own.

这篇关于寻找一种使用JS抓取HTML的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆