使用可读性API从页面抓取大多数相关图片 [英] using readability API to scrape most relavant image from page

查看:78
本文介绍了使用可读性API从页面抓取大多数相关图片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用可读性API来做到这一点.在他们的示例中,他们显示了lead_img_url,但我无法获取.

I am using readability API to do this. In their example they have show lead_img_url but I could not fetch it.

参考: https://www.readability.com/developers/api/parser

这是直接发出请求的正确方法:

Is this correct way to make direct request:

  1. https://www. readability.com/parser/?token=1b830931777ac7c2ac954e9f0d67df437175e66e&url=http://nextbigwhat.com

    它说:{"messages": "The API Key in the form of the 'token' parameter is invalid.", "error": true}

    另一种尝试:

    <?php
        define('TOKEN', "1b830931777ac7c2ac954e9f0d67df437175e66e");    
        define('API_URL', "https://www.readability.com/api/content/v1/parser?url=%s&token=%s");
    
       function get_image($url) {   
    
        // sanitize it so we don't break our api url    
        $encodedUrl = urlencode($url);    
        $TOKEN = '1b830931777ac7c2ac954e9f0d67df437175e66e';    
        $API_URL = 'https://www.readability.com/api/content/v1/parser?url=%s&token=%s';    
    //  $API_URL = 'http://blog.readability.com/2011/02/step-up-be-heard-readability-ideas';    
        // build our url   
        $url = sprintf($API_URL, $encodedUrl, $TOKEN);    
    
        // call the api    
        $response = file_get_contents($url);    
        if( $response ) {    
            return false;   
        }    
        $json = json_decode($response);    
        if(!isset($json['lead_image_url'])) {    
            return false;    
        }    
    
        return $json['lead_image_url'];
    
    }
    

    错误:Warning: file_get_contents(https://www.readability.com/api/content/v1/parser?url=http%3A%2F%2Fthenwat.com%2Fthenwat%2Finvite%2Findex.php&amp;token=1b830931777ac7c2ac954e9f0d67df437175e66e): failed to open stream: HTTP request failed! HTTP/1.1 403 FORBIDDEN in F:\wamp\www\inviteold\test2.php on line 32

    一个:

    require 'readability/lib/Readability.inc.php';
    $url = 'http://www.nextbigwhat.com';
    $html = file_get_contents($url);
    
    $Readability     = new Readability($html); // default charset is utf-8
    $ReadabilityData = $Readability->getContent();
    
    $image= $ReadabilityData['lead_image_url'];
    $title= $ReadabilityData['title']; //This works fine.
    $content = $ReadabilityData['word_count'];
    
    echo "$content"; 
    

    它说:Notice: Undefined index: lead_image_url in F:\wamp\www\inviteold\test2.php on line 13

    推荐答案

    首先,为了使用他们提供的REST API,您需要创建一个帐户.之后,您可以生成自己的token以在呼叫中使用.这些示例提供的token无效,因为它是有意无效的.其目的仅是示例.

    First, in order to use the REST API that they provide, you need to create an account. Afterwards you can generate your own token to use in the call. The token provided by the examples will not work because it is purposefully invalid. Its purpose is for example only.

    第二,确保php.ini文件中的allow_url_fopen指令设置为true.出于测试脚本的目的,或者如果您无法更改php.ini文件(共享托管解决方案),则可以在页面顶部使用ini_set('allow_url_fopen', true);.

    Second, make sure the allow_url_fopen directive in your php.ini file is set to true. For the purposes of a test script, or if you cannot change your php.ini file (shared hosting solutions), you can use ini_set('allow_url_fopen', true); at the top of your page.

    最后,为了自己解析图像,您需要从检索到的DOM中检索所有图像元素.有时没有任何图像,有时会有.这取决于您要从哪个页面提取.此外,您需要解析相对路径...

    Lastly, in order to parse the images yourself you'll need to retrieve all image elements from the DOM you retrieve. Sometimes there won't be any images, and sometimes there will be. It depends on what page you're pulling from. Additionally, you'll need to resolve relative paths...

    您的代码

    require 'readability/lib/Readability.inc.php';
    $url = 'http://www.nextbigwhat.com';
    $html = file_get_contents($url);
    
    $Readability     = new Readability($html); // default charset is utf-8
    $ReadabilityData = $Readability->getContent();
    
    $image= $ReadabilityData['lead_image_url'];
    $title= $ReadabilityData['title']; //This works fine.
    $content = $ReadabilityData['word_count'];
    
    echo "$content"; 
    

    执行Readability后,可以利用DOMDocument类从提取的内容中检索图像.实例化一个新的DOMDocument并加载您的HTML.确保使用libxml_use_internal_errors函数抑制大多数网站上的解析器引起的错误.我们将其放在一个函数中,以便在需要时更易于在其他地方使用.

    After executing Readability, you can utilize the DOMDocument class to retrieve your images from the contents you pulled. Instantiate a new DOMDocument and load in your HTML. Make sure to use the libxml_use_internal_errors function to supress errors caused by the parser on most websites. We'll put this in a function to make it easier to use elsewhere if needbe.

    function sampleDomMedia($html) {
        // Supress validator errors
        libxml_use_internal_errors(true);
    
        // New document
        $dom = new DOMDocument();
        // Populate document
        $dom->loadHTML($html);
        //[...]
    

    您现在可以从实例化的文档中检索所有图像元素,然后获取其src属性...,如下所示:

    You can now retrieve all image elements from the document you instantiated, and then get their src attribute... like so:

        //[...]
        // Get image elements
        $nodeList = $dom->getElementsByTagName('img');
    
        // Get length
        $length = $nodeList->length;
    
        // Initialize array
        $images = array();
    
        // Iterate over our nodes
        for($i=0;$i<$length;$i++) {
            // Get the current node
            $node = $nodeList->item($i);
            // Retrieve the src attribute
            $image = $node->getAttribute('src');
    
            // Push image src into $images array
            array_push($images,$image);
        }
    
        return $images;
    }
    

    现在,您可以拥有一系列图像,可以呈现给用户使用.但是在您执行此操作之前,我们忘记了另一件事...我们要解析所有相对路径,以便始终拥有到另一个站点上的图像的绝对路径.

    Now you have an array of images that you can present to the user for use. But before you do that, we forgot one more thing... We want to resolve all relative paths so that we always have an absolute path to the image that lives on another site.

    为此,我们必须确定基本域URL以及正在使用的当前页面的相对路径.我们可以使用PHP提供的parse_url()函数来实现.为了简单起见,我们可以将其放入一个函数中.

    To do this, we have to determine the base domain URL, and the relative path to the current page we're working with. We can do so using the parse_url() function provided by PHP. For simplicity's sake, we can throw this into a function.

    function getUrls($url) {
        // Parse URL
        $urlArr = parse_url($url);
    
        // Determine Base URL, with scheme, host, and port
        $base = $urlArr['scheme']."://".$urlArr['host'];
        if(array_key_exists("port",$urlArr) && $urlArr['port'] != 80) {
            $base .= ":".$urlArr['port'];
        }
    
        // Truncate the Path using the position of the last forward slash
        $relative = $base.substr($urlArr['path'], 0, strrpos($urlArr['path'],"/")+1);
    
        // Return our two URLs
        return array($base, $relative);
    }
    

    在原始的sampleDomMedia函数中添加一个附加参数,我们可以调用该函数来获取路径.然后,我们可以检查src属性的值以确定它是哪种路径,并解决该问题.

    Add an additional parameter to the original sampleDomMedia function, and we can call this function to get our paths. Then we can check the src attribute's value to determine what kind of path it is, and resolve it.

    function sampleDomMedia($html, $url) {
        // Retrieve our URLs
        list($baseUrl, $relativeUrl) = getUrls($url);
    
        libxml_use_internal_errors(true);
    
        $dom = new DOMDocument();
        $dom->loadHTML($html);
    
        $nodeList = $dom->getElementsByTagName('img');
        $length = $nodeList->length;
        $images = array();
    
        for($i=0;$i<$length;$i++) {
            $node = $nodeList->item($i);
            $image = $node->getAttribute('src');
    
            // Resolve relative paths
            if(substr($image,0,2)=="//") { // Missing protocol
                $image = "http:".$image;
            } else if(substr($image,0,1)=="/") { // Path Relative to Base
                $image = $baseUrl.$image;
            } else if(substr($image,0,4)!=="http") { // Path Relative to Dimension
                $image = $relativeUrl.$image;
            }
    
            array_push($images,$image);
        }
    
        return $images;
    }
    

    最后但并非最不重要的一点是,我们剩下了前两个功能以及这段程序代码:

    And last, but certainly not least, we're left with the two previous functions, and this piece of procedural code:

    require 'readability/lib/Readability.inc.php';
    $url = 'http://www.nextbigwhat.com';
    $html = file_get_contents($url);
    
    $Readability     = new Readability($html); // default charset is utf-8
    $ReadabilityData = $Readability->getContent();
    
    $image = $ReadabilityData['lead_image_url'];
    $images = sampleDomMedia($html, $url);
    
    $title = $ReadabilityData['title']; //This works fine.
    $content = $ReadabilityData['word_count'];
    
    echo "$content";
    

    此外,如果您认为文章的内容中可能包含图片(通常没有),则可以使用从Readability返回的contents而不是$html变量,如下所示:

    Also, if you think the contents of the article may have an image inside of it (usually doesn't), you can use the contents returned from Readability rather than the $html variable, like so:

    $title = $ReadabilityData['title']; //This works fine.
    $content = $ReadabilityData['word_count'];
    $images = sampleDomMedia($content, $url);
    

    我希望能帮上忙.

    这篇关于使用可读性API从页面抓取大多数相关图片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆