你如何抓取 AJAX 页面? [英] How do you scrape AJAX pages?

查看:19
本文介绍了你如何抓取 AJAX 页面?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请告知如何抓取 AJAX 页面.

Please advise how to scrape AJAX pages.

推荐答案

概述:

所有屏幕抓取首先需要手动检查您要从中提取资源的页面.在处理 AJAX 时,您通常只需要分析更多的内容,而不仅仅是简单的 HTML.

All screen scraping first requires manual review of the page you want to extract resources from. When dealing with AJAX you usually just need to analyze a bit more than just simply the HTML.

在处理 AJAX 时,这仅意味着您想要的值不在您请求的初始 HTML 文档中,而是会执行 javascript,它会向服务器询问您想要的额外信息.

When dealing with AJAX this just means that the value you want is not in the initial HTML document that you requested, but that javascript will be exectued which asks the server for the extra information you want.

因此,您通常可以简单地分析 javascript 并查看 javascript 发出的请求,然后从一开始就调用此 URL.

You can therefore usually simply analyze the javascript and see which request the javascript makes and just call this URL instead from the start.

示例:

以此为例,假设您要抓取的页面具有以下脚本:

Take this as an example, assume the page you want to scrape from has the following script:

<script type="text/javascript">
function ajaxFunction()
{
var xmlHttp;
try
  {
  // Firefox, Opera 8.0+, Safari
  xmlHttp=new XMLHttpRequest();
  }
catch (e)
  {
  // Internet Explorer
  try
    {
    xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");
    }
  catch (e)
    {
    try
      {
      xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");
      }
    catch (e)
      {
      alert("Your browser does not support AJAX!");
      return false;
      }
    }
  }
  xmlHttp.onreadystatechange=function()
    {
    if(xmlHttp.readyState==4)
      {
      document.myForm.time.value=xmlHttp.responseText;
      }
    }
  xmlHttp.open("GET","time.asp",true);
  xmlHttp.send(null);
  }
</script>

然后您需要做的就是向同一服务器的 time.asp 发出 HTTP 请求.来自 w3schools 的示例.

Then all you need to do is instead do an HTTP request to time.asp of the same server instead. Example from w3schools.

使用 C++ 进行高级抓取:

对于复杂的用法,如果您使用 C++,您还可以考虑使用 firefox javascript 引擎 SpiderMonkey 在页面上执行 javascript.

For complex usage, and if you're using C++ you could also consider using the firefox javascript engine SpiderMonkey to execute the javascript on a page.

使用 Java 进行高级抓取:

对于复杂的用法,如果您使用的是 Java,您还可以考虑使用用于 Java 的 firefox javascript 引擎 Rhino

For complex usage, and if you're using Java you could also consider using the firefox javascript engine for Java Rhino

使用 .NET 进行高级抓取:

对于复杂的使用,如果您使用 .Net,您还可以考虑使用 Microsoft.vsa 程序集.最近替换为 ICodeCompiler/CodeDOM.

For complex usage, and if you're using .Net you could also consider using the Microsoft.vsa assembly. Recently replaced with ICodeCompiler/CodeDOM.

这篇关于你如何抓取 AJAX 页面?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆