Javascript可以读取任何网页的源代码? [英] Can Javascript read the source of any web page?

查看:124
本文介绍了Javascript可以读取任何网页的源代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理屏幕抓取,并且想要检索特定页面的源代码。

如何用javascript实现这个功能?请帮助我。

解决方案

简单的开始方法,试试 jQuery

  $(#links)。load(/ Main_Page#jq- p-Getting-Started li); 

更多 jQuery Docs



以更结构化的方式执行屏幕抓取的另一种方法是使用 YQL或Yahoo Query Language。它将返回结构为JSON或XML的抓取数据。

例如

让我们来刮取stackoverflow.com

  select * from html where url =http :

会给你一个JSON数组(我选择了这个选项)

 results:{
body:{
noscript:[
{
div:{
id:noscript-padding
}
},
{
div:{
id:noscript-warning,
p:堆栈溢出在启用JavaScript时效果最好
}
}
],
div:[
{
id:notify-container
},
{
div:[
{
id:header,
div:[
{
id:hlogo,
a:{
href:/,
img :{
alt:logo homepage,
height:70,
src:http://i.stackoverflow.com/Content/Img/ stackoverflow-logo-250.png,
width:250
}
...... ..

这样做的好处在于,您可以执行 投影和其中 子句,这些子句最终会为您提供结构化的刮取数据 >只有数据你所需要的(最终网络上的带宽要少得多)

eg

  select * from html where url =http://stackoverflow.com和
xpath ='// div / h3 / a'

会帮助您

 results :{ 
a:
{
href:/ questions / 414690 / iphone-simulator-port-for-windows-closed,
title:重复:是否有任何Windows模拟器可用于测试iPhone应用程序?作为一个爱不起mac的爱好者,我在Cygwin本地设置了一个工具链工具包来编译objecti ...,
content:iphone \\\
windows模拟器端口[关闭]
} ,
{
href:/ questions / 680867 / how-to-redirect-the-web-page-in-flex-application,
title:一个按钮控制....我需要另一个网页重定向,同时点击该按钮....如何做到这一点?谢谢,
content:如何在flex应用程序中重定向网页?
},
... ..



现在我们只提出一个问题

 从html中选择标题where url =http://stackoverflow.com和
xpath ='// div / h3 / a'

注意预测中的

 结果:{
a:[
{
title:我不希望多个线程同时输入函数,我也不想再次输入函数当它还没有回来。是否有任何方法来实现......
},
{
title:我确定我正在做一些非常愚蠢的事情,但我一直在想它现在出去了几个小时,没有什么会跳到我身上。我正在使用ModelForm,所以我可以...
},
{
title:当我在IE中浏览我的项目时,只显示错误发生运行时错误Do你想调试?第768行错误:预期')'这是关于任何脚本呃...
},
{
title:我有一个java批处理文件,包含4个执行步骤,分析任何Java应用程序。在其中一个步骤中,我在classpath中添加了少量的库,这些库用于我的co ...
},
{
......

一旦你写下你的查询,它就为你生成一个url

http:// query .yahooapis.com / V1 /公共/ YQL q =选择%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20于是%0A%20%20%20%20 %20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20& format = json& callback = cbfunc



因此,最终你最终会做这样的事情

  var titleList = $ .getJSON(theAboveUrl); 

并与之一起玩。 / p>

美丽 ,不是吗?


I am working on screen scraping, and want to retrieve the source code a particular page.

How can achieve this with javascript? Please help me.

解决方案

Simple way to start, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

More at jQuery Docs

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let's scrape stackoverflow.com

select * from html where url="http://stackoverflow.com"

will give you a JSON array (I chose that option) like this

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

will get you

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

Now to get only the questions we do a

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

Note the title in projections

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

Once you write your query it generates a url for you

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

in our case.

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl);

and play with it.

Beautiful, isn’t it?

这篇关于Javascript可以读取任何网页的源代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆