如何提高这一点?在另一个域中获取网页的一部分 [英] How to Enhance This? Get a Part of a Web Page in Another Domain

查看:66
本文介绍了如何提高这一点?在另一个域中获取网页的一部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经这样做了:

 < html> 
< head>
< script src =// ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js\"> ;</script>
< script>
$(document).ready(
function()
{
$(body)。html($(#HomePageTabs_cont_3)。html());
}
);
< / script>
< / head>
< body>
<?php
echo file_get_contents(http://www.bankasya.com.tr/index.jsp);
?>

< / body>
< / html>

当我使用Firebug检查我的页面时,它会提供无数丢失的文件(图像,css文件, js文件等)错误。我想只是网页的一部分而已。 此代码符合我的要求。但是我想知道是否有更好的方法。



编辑:

该页面做我需要的。我不需要所有的内容。所以 iframe 对我来说毫无用处。我只想要div #HomePageTabs_cont_3 的原始数据。

解决方案

你最好的选择是PHP服务器端解析。我已经写了一小段代码向你展示了如何使用 DOMDocument (可能还有 tidy >如果你的服务器拥有它,以排除所有形式错误的XHTML foos)。


$ b 警告:输出UTF-8。你可以在DOMDocument的构造函数中改变它



注意事项2 -8不是iso-8859-9。

  header(content-键入:text / html; charset = utf-8); 
$ data = file_get_contents(http://www.bankasya.com.tr/index.jsp);
//清理它
if(class_exists(tidy)){
$ dataTidy = new tidy();
$ dataTidy-> parseString($ data,
array(
input-encoding=>iso-8859-9,
output-encoding= >iso-8859-9,
clean=> 1,
input-xml=> true,
output-xml=>
wrap=> 0,
anchor-as-name=> false

);
$ dataTidy-> cleanRepair();
$ data =(string)$ dataTidy;
}
else {
$ do = true;
while($ do){
$ start = stripos($ data,'< script');
$ stop = stripos($ data,'< / script>');如果((is_numeric($ start))&&(is_numeric($ stop))){
$ s = substr($ data,$ start,$ stop- $ start);

$ data = substr($ data,0,$ start).substr($ data,($ stop + strlen('< / script>')));
} else {
$ do = false;
}
}
//破解它?
$ data = str_replace(& nbsp;,,$ data);
//修复任何需要自闭标记的元素
if(preg_match_all(/<(link | img)([^>] +)> / is,$ data ,$ mt,PREG_SET_ORDER)){
foreach($ mt as $ v){
if(substr($ v [2], - 1)!=/){
$ data = str_replace($ v [0],<。$ v [1]。$ v [2]。/>,$ data);


$ b // Barf out in line JS
$ data = preg_replace(/ javascript:[^;] + / is,# ,$数据);
// Barf out noscripts
$ data = preg_replace(#< noscript>(。+?)< / noscript> #is,,$ data);
// Muppets。格式错误的评论=另外一个正则表达式,他们可以学习编写正确的HTML ...
$ data = preg_replace(#<! - (。*?) - !?> #is, ,$数据);
}
$ DOM = new \DOMDocument(1.0,utf-8);
$ DOM-> recover = true;
函数error_callback_xmlfunction($ errno,$ errstr){抛出新的异常($ errstr); }
$ old = set_error_handler(error_callback_xmlfunction);
//抛出所有的XML命名空间(如果有的话)
$ data = preg_replace(#xmlns = [\\']?([^ \\'] +) (\\(substr($ data,0,5,$) )!==<?xml)?'<?xml version =1.0encoding =utf-8?>':)。$ data);
} catch(Exception $($($($($)$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $'$' );
restore_error_handler();
error_reporting(E_ALL);
$ DOM-> queryEntities = true;
$ xpath = new \DOMXPath($ DOM);
echo $ DOM-> saveXML($ xpath-> query(// div [@ id = \ HomePageTabs_cont_3 \]) - > item(0));


$ b

  • 提取数据

  • 如果我们有 tidy ,用它清理HTML

  • 创建一个新的 DOMDocument 并加载我们的文档( $ dataT

  • 创建一个XPath请求路径

  • 使用XPath向所有div请求id设置为我们想要的,获取集合的第一项( - > item(0),这将是一个 DOMElement )并请求DOM输出其XML内容(包括标签本身)


    希望这是你的'重新寻找...尽管您可能想将它包装在一个函数中。



    编辑



    忘记提及: http://rescrape.it/rs.php 为实际脚本输出!



    编辑2



    更正,该网站不是W3C有效的,因此,您需要 tidy ,或者在处理之前将一组正则表达式应用于输入。我会看看我是否可以制定一套解决不一致的问题。

    编辑3



    为我们所有那些没有 tidy 的人增加了一个修复程序。



    编辑4



    无法抗拒。如果你真的喜欢这些值而不是表格,可以使用它来代替echo:

      $ d = new stdClass( ); 
    $ rows = $ xpath-> query(// div [@ id = \HomePageTabs_cont_3\] // tr);
    $ rc = $ rows->长度;
    for $($ i = 1; $ i <$ rc-1; $ i ++){
    $ cols = $ xpath->查询($ rows-> item($ i) - > ; getNodePath() / TD);
    $ d-> {$ cols-> item(0) - > textContent} = array(
    ((float)$ cols-> item(1) - > textContent),
    ((float)$ cols-> item(2) - > textContent)
    );
    }

    我不了解你,但对我而言,格式不正确的表格。



    (Welp,需要一段时间才能写出)

    I have made this:

    <html>
        <head>
            <script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
            <script>
                $(document).ready(
                    function()
                    {   
                        $("body").html($("#HomePageTabs_cont_3").html());
                    }
                );
            </script>
        </head>
        <body>
        <?php
            echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
        ?>
    
        </body>
    </html>
    

    When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.

    EDIT:

    The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.

    解决方案

    Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).

    Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument

    Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.

    header("content-type: text/html; charset=utf-8");
    $data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
    // Clean it up
    if (class_exists("tidy")) {
       $dataTidy = new tidy();
       $dataTidy->parseString($data,
                                     array(
                                           "input-encoding" => "iso-8859-9",
                                           "output-encoding" => "iso-8859-9",
                                           "clean" => 1,
                                           "input-xml" => true,
                                           "output-xml" => true,
                                           "wrap" => 0,
                                           "anchor-as-name" => false
                                     )
                              );
       $dataTidy->cleanRepair();
       $data = (string)$dataTidy;
    }
    else {
        $do = true;
                while ($do) {
                        $start = stripos($data,'<script');
                        $stop = stripos($data,'</script>');
                        if ((is_numeric($start))&&(is_numeric($stop))) {
                                $s = substr($data,$start,$stop-$start);
                                $data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
                        } else {
                                $do = false;
                        }
                }
        // nbsp breaks it?
        $data = str_replace("&nbsp;"," ",$data);
        // Fixes for any element that requires a self-closing tag
        if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
                foreach ($mt as $v) {
                        if (substr($v[2],-1) != "/") {
                                $data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
                        }
                }
        }
        // Barf out the inline JS
        $data = preg_replace("/javascript:[^;]+/is","#",$data);
        // Barf out the noscripts
        $data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
        // Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
        $data = preg_replace("#<!--(.*?)--!?>#is","",$data);
    }
    $DOM = new \DOMDocument("1.0","utf-8");
    $DOM->recover = true;
        function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
        $old = set_error_handler("error_callback_xmlfunction");
    // Throw out all the XML namespaces (if any)
    $data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
    try {
          $DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
    } catch (Exception $e) {
          $DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
    }
        restore_error_handler();
    error_reporting(E_ALL);
    $DOM->substituteEntities = true;
    $xpath = new \DOMXPath($DOM);
    echo $DOM->saveXML($xpath->query("//div[@id=\"HomePageTabs_cont_3\"]")->item(0));
    

    In order of appearance:

    • Fetch the data
    • If we have tidy, sanitize HTML with it
    • Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
    • Create an XPath request path
    • Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)

    Hope it is what you're looking for... Though you might want to wrap it in a function.

    Edit

    Forgot to mention: http://rescrape.it/rs.php for the actual script output!

    Edit 2

    Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.

    Edit 3

    Added a fix for all those of us who do not have tidy.

    Edit 4

    Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:

     $d = new stdClass();
     $rows = $xpath->query("//div[@id=\"HomePageTabs_cont_3\"]//tr");
     $rc = $rows->length;
     for ($i = 1; $i < $rc-1; $i++) {
         $cols = $xpath->query($rows->item($i)->getNodePath()."/td");
         $d->{$cols->item(0)->textContent} = array(
            ((float)$cols->item(1)->textContent),
            ((float)$cols->item(2)->textContent)
         );
     }
    

    I don't know about you, but for me, data works better than malformed tables.

    (Welp, that one took a while to write)

    这篇关于如何提高这一点?在另一个域中获取网页的一部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆