获取从其他网站的HTML内容 [英] Get HTML content from another site

查看:421
本文介绍了获取从其他网站的HTML内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想动态检索其他网站的HTML内容,我公司的许可。

I would like to dynamically retrieve the html contents from another website, I have the permission of the company.

请不要点我JSONP,因为我不能编辑站点A,只有站点B

Please, don't point me to JSONP, because I can't edit Site A, only Site B

推荐答案

由于跨域安全问题,你将不能够做到这一点的客户端,除非你的内容与 IFRAME

Because of cross-domain security issues, you won't be able to do this client-side, unless you're content with an iframe.

使用PHP,你可以使用刮的内容的几种方法。您使用的方法取决于你是否需要在申请使用Cookie(即数据是登录后)。

With PHP, you can use several methods of "scraping" the content. The approach you use depends on whether you need to use cookies in your requests (i.e. the data is behind a login).

无论哪种方式,在客户端开始做事了,你会发出一个标准的AJAX请求您的自己的服务器

Either way, to start things off on the client side you'll issue a standard AJAX request to your own server:

$.ajax({
  type: "POST",
  url: "localProxy.php",
  data: {url: "maybe_send_your_url_here.php?product_id=1"}
}).done(function( html ) {
   // do something with your HTML!
});

如果你需要设置cookie(如果远程站点需要登录,你需要'时间),你会使用卷曲。登录用POST数据和接受Cookie的完整机制是有点超出了这个答案的范围,但是你的要求会是这个样子:

If you need cookies set (if the remote site requires login, you need 'em), you're going to use cURL. The full mechanics of logging in with post data and accepting cookies is a little beyond the scope of this answer, but your requests would look something like this:

$ch = curl_init(); 
curl_setopt ($ch, CURLOPT_URL, 'http://thirdpartydomain.internet/login_url.php'); 
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); 
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); 
curl_setopt ($ch, CURLOPT_TIMEOUT, 60); 
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 0); 
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.jar'); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'email='.$username.'&password='.$password); 
curl_setopt ($ch, CURLOPT_POST, 1); 
$result = curl_exec ($ch); 
curl_close($ch);

在这一点上,你可以检查 $结果变量,并确保登录工作。如果是这样,你再使用卷曲发行的另一个的要求,抓住页面内容。第二个请求将不会有所有的垃圾后,你会用你试图获取的URL。你会最终有一个大字符串完整的HTML中。

At that point, you can check the $result variable and make sure the login worked. If so, you'd then use cURL to issue another request to grab the page content. The second request won't have all the post junk, and you'd use the URL that you're trying to fetch. You'd end up with a large string full of HTML.

如果你只需要一个网页的内容的一部分,你可以用下面的方法来加载串入一个DOMDocument,而不是使用<$ c中的 loadHTML 方法$ C> loadHTMLFile (见下文)

If you only need a portion of that page's content, you can use the method below to load the string into a DomDocument, use the loadHTML method instead of loadHTMLFile (see below)

说到的DomDocument,如果您不要需要饼干,那么你可以使用的DomDocument直接抓取页面,跳绳卷曲:

Speaking of DomDocument, if you don't need cookies, then you can use DomDocument directly to fetch the page, skipping cURL:

$doc = new DOMDocument('1.0', 'UTF-8');
// load the string into the DOM (this is your page's HTML), see below for more info
$doc->loadHTMLFile ('http://third_party_url_here.php?query=string');

// since we are working with HTML fragments here, remove <!DOCTYPE 
$doc->removeChild($doc->firstChild);            

// remove <html></html> and any junk
$body = $doc->getElementsByTagName('body'); 
$doc->replaceChild($body->item(0), $doc->firstChild);

// now, you can get any portion of the html (target a div, for example) using familiar DOM methods

// echo the HTML (or desired portion thereof)
die($doc->saveHTML());

文件

  • HTML iframe on MDN - https://developer.mozilla.org/en/HTML/Element/iframe
  • jQuery.ajax() - http://api.jquery.com/jQuery.ajax/
  • PHP's cURL - http://php.net/manual/en/book.curl.php
  • Curl::set_opt (information about using cookies) - http://www.php.net/manual/en/function.curl-setopt.php
  • PHP's DomDocument - http://php.net/manual/en/class.domdocument.php
  • DomDocument::loadHTMLFile - http://www.php.net/manual/en/domdocument.loadhtmlfile.php
  • DomDocument::loadHTML - http://www.php.net/manual/en/domdocument.loadhtml.php

这篇关于获取从其他网站的HTML内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆