得到董事会的所有图像从Pinterest的网址 [英] Get all images from a board from a Pinterest web address

查看:207
本文介绍了得到董事会的所有图像从Pinterest的网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题听起来很简单,但它不是那么简单,因为它的声音。

小结什么是错

有关的一个例子,用这个板; <一href="http://pinterest.com/dodo/web-designui-and-mobile/">http://pinterest.com/dodo/web-designui-and-mobile/

检查的HTML板本身(在 DIV 带班 GridItems )处的顶部打印量:

&LT; D​​IV CLASS =variableHeightLayout padItems GridItems模块centeredWithinWrapper的风格=...&GT;     &LT;! - 第一个div与显示板的图像 - &GT;     &LT; D​​IV CLASS =项的风格=顶:0px;左:0px;能见度:可见;&GT; ..&LT; / DIV&GT;     ...     &LT; - !最近DIV与显示板的图像 - &GT;     &LT; D​​IV CLASS =项的风格=顶:3343px;左:1000px;能见度:可见;&GT; ..&LT; / DIV&GT; &LT; / DIV&GT;

然而,在页面的底部,激活无限滚动几次后,我们得到这作为HTML

&LT; D​​IV CLASS =variableHeightLayout padItems GridItems模块centeredWithinWrapper的风格=...&GT;     &LT;! - 第一个div与显示板的图像 - &GT;     &LT; D​​IV CLASS =项的风格=顶:12431px;左:750px;能见度:可见;&GT; ..&LT; / DIV&GT;     ...     &LT; - !最近DIV与显示板的图像 - &GT;     &LT; D​​IV CLASS =项的风格=顶:19944px;左:750px;能见度:可见;&GT; ..&LT; / DIV&GT; &LT; / DIV&GT;

正如你所看到的,一些更高了在页面上已经消失了容器的图像,而不是所有的容器为图像的加载时首先加载的页面。


我想做什么

我希望能够创建一个C#脚本(或任何服务器端语言的时刻),可以下载页面的完整HTML(即,检索每个页面上的图像),图像将被从下载他们的网址。下载网页和使用适当的XPath是容易的,但真正的挑战是下载完整的HTML的每一个形象。

有没有一种方法,我可以模拟滚动到页面底部,或者是有,我可以检索每个图像更简单的方法?我想,Pinterest的使用AJAX来改变HTML,是有办法,我可以通过编程的触发事件接收所有的HTML?预先感谢您的建议和解决方案,并为赞誉,甚至阅读很长的问题,如果你没有任何的!

伪code

 使用系统;
使用System.Net;
使用HtmlAgilityPack;

私人无效的主要(){
    字符串pinterestURL =HTTP://www.pinterest.com / ...;
    字符串的XPath =... / IMG;

    的HTMLDocument DOC =新的HTMLDocument();

    //目前仅下载前25幅图像。
    doc.Load(strPinterestUrl);

    的foreach(在doc.DocumentElement.SelectNodes HtmlNode链接(strXPath))
    {
         图片链接[] =链接[SRC];
         //使用图片链接
    }
}
 

解决方案

好了,所以我觉得这可能是(有一些改动),你所需要的。

注意事项:

  1. 这是PHP,而不是C#(但你说你感兴趣的任何服务器端语言)。
  2. 这code挂接到(非官方)Pinterest的搜索端点。你需要修改$数据和$ search_res以反映相应的端点(如BoardFeedResouce)为您的任务。注意:至少对于搜索,Pinterest的目前使用的两个端点,一个是初始页面加载,和另一个为无限滚动操作。每个人都有自己的预期参数结构。
  3. 在Pinterest的没有官方公开的API,预计这将打破时,他们改变什么,并没有任何警告。
  4. 您可能会发现pinterestapi.co.uk更容易实现,并接受自己在做什么。
  5. 在我有一个不应该出现的,一旦你得到你想要的数据类下的一些demo /调试code和默认的页面抓取的限制,您可能要改变。

兴趣点:

  1. 下划线 _ 参数需要在JavaScript格式,即时间戳。像Unix的时间,但它已经毫秒增加。它不是实际用于分页。
  2. 分页使用书签属性,所以你做的第一个请求到不需要它,然后乘书签>书签 从这些结果,以获取下页之后,依此类推,直到你用完结果或达到你的pre-设置的限制(或者你打的最大的服务器脚本的执行时间)。我很好奇,想知道到底是什么书签字段连接codeS。我想觉得有一些有趣的秘诀不仅仅是一针ID或其他页面的标记。
  3. 在我跳过了HTML,而不是处理JSON,因为它更容易(对我来说)比使用DOM操作的解决方案,或者一串正则表达式。

&LT; PHP 如果(!class_exists('Skrivener_Pins')){   类Skrivener_Pins {     / **      *构造函数      * /     公共职能__construct(){     }     / **      * Pinterest的搜索功能。使用Pinterest的的内部页面的API,因此有可能打破,如果他们改变。      * @author [@skrivener]菲利普Tillsley      *参数$ search_str用来搜索匹配引脚的字符串。      *页@参数$限制最大数量来获得,默认为2,避免过大的问题。在值传递时要小心。      *参数$ bookmarks_str内部递归获取用于。      *参数$页内部使用,以限制递归。      * @返回数组()INT ['身份证'],OBJ ['形象'],海峡['pin_link'],海峡['orig_link'],BOOL ['video_flag']      *      * 去做:         *         *      * /     公共职能get_tagged_pins($ search_str,$限制= 1,$ bookmarks_str = NULL,$页= 1){       //递归,即极限深度。的25页号退回,否则,我们就可以在巨大的疑问挂       如果($页&GT; $限制)返回false;       //我们得到销与否的下一个页面       $ next_page = FALSE;       如果(使用isset($ bookmarks_str))$ next_page = TRUE;       //构建URL组件       如果(!$ next_page){         //第1次         $ search_res ='BaseSearchResource'; //终点         $ PATH ='和; module_path中='。 urlen code('SearchInfoBar(查询='$ search_str',范围=板)');         $数据= preg_replace('[\ñ\ r \ S \ T]',,{           选择:{             范围:针,             show_scope_selector:真实,             查询:'$ search_str。'           },           背景:{             APP_VERSION:2f83a7e           },           模块:{             名:SearchPage             选择:{               范围:针,               查询:'$ search_str。'             }           },           追加:假的,           error_strategy:0           }');       } 其他 {         //这是取为滚动,什么样的变化是书签参考,         //所以通过previous书签值这个功能,它包含         //查询         $ search_res ='SearchResource';从第一时间搜索//不同终点         $ PATH ='';         $数据= preg_replace('[\ñ\ r \ S \ T]',,{           选择:{             查询:,$ search_str。'             书签:'。$ bookmarks_str。'],             show_scope_selector:空,             范围:针           },           背景:{             APP_VERSION:2f83a7e           },             模块:{               名:GridItems             选择:{               滚动:真实,               show_grid_footer:真实,               中心:真实,               reflow_all:真实,               虚拟化:真实,               item_options:{                 show_pinner:真实,                 show_pinned_from:假的,                 show_board:真实的               },               布局:variable_height             }           },           追加:真实,           error_strategy:2         }');       }       $数据= urlen code($的数据);       $时间戳=时间()* 1000; // UNIX时间,但在JS格式(即具有ms和在秒正常服务器时间),* 1000加毫秒(即0毫秒)       //生成网址       $ URL ='http://pinterest.com/resource/。 $ search_res。 /获取/?source_url = /搜索/销/ Q ='。 $ search_str           。 '和;数据='。 $数据           。 $ PATH           。 '&安培; _ ='。 $时间戳; //1378150472669;       //设置卷曲       $ CH = curl_init();       curl_setopt($ CH,CURLOPT_URL,$网址);       curl_setopt($ CH,CURLOPT_RETURNTRANSFER,真正的);       curl_setopt($ CH,CURLOPT_HTTPHEADER,阵列(X-要求,通过:XMLHtt prequest));       //得到的结果       $ curl_result = curl_exec($ CH); //此相呼应的输出       $ curl_result = json_de code($ curl_result);       curl_close($ CH);       //清晰的HTML,使var_dumps更容易被看到在调试的时候       // $ curl_result-&GT;模块 - &GT; HTML ='';       //隔离管脚数据,不同的结束点具有不同的数据结构       如果(!$ next_page)$ pin_array = $ curl_result-&GT;模块 - &GT;树形&GT;儿童[1] - &GT;儿童[0] - &GT;儿童[0] - &GT;儿童;       否则$ pin_array = $ curl_result-&GT;模块 - &GT;树形&GT;儿童;       //映射针数据成所需的格式       $ pin_data_array =阵列();       $书签= NULL;       如果(is_array($ pin_array)){         如果(计数($ pin_array)){           的foreach($ pin_array为$针){             //设置数据             $ image_id = $管脚和GT;选项 - &GT; pin_id;             $ = IMAGE_DATA(使用isset($管脚和GT;数据 - &GT;图像 - &GT;原件))? $引脚&GT;数据 - &GT;图像 - &GT;原稿:$管脚和GT;数据 - &GT;图像 - &GT;原稿;             $ pin_url ='http://pinterest.com/pin/。 $ image_id。 '/';             $ original_url = $管脚和GT;数据 - &GT;链接;             $视频= $管脚和GT;数据 - &GT; is_video;             array_push($ pin_data_array,阵列(               ID=&GT; $ image_id,               形象=&GT; $ IMAGE_DATA,               pin_link=&GT; $ pin_url,               orig_link=&GT; $ original_url,               video_flag=&GT; $视频,               ));           }           $书签=重置($ curl_result-&GT;模块 - &GT;树形&GT;资源 - &GT;选项 - &GT;书签);         } 其他 {           $ pin_data_array = FALSE;         }       }       //递归,直到我们完成       如果(!($ pin_data_array ===假)及&安培;!is_null($书签)){         //更多的引脚获得         $ more_pins = $这个 - &GT; get_tagged_pins($ search_str,$限制,$书签,++ $页);         如果(!($ more_pins === FALSE))$ pin_data_array = array_merge($ pin_data_array,$ more_pins);         返回$ pin_data_array;       }       //递归结束       返回false;     }   } //结束类Skrivener_Pins 如果} //结束 / **  *调试/演示code  *删除或注释本节进行生产  * / //输出头以控制内容的显示方式 //标题(内容类型:应用程序/ JSON); 标题(内容类型:text / plain的); //标题(内容类型:text / html的); //定义搜索词 // $标记=维达; $标记=溶血; // $标记=qjkjgjerbjjkrekhjk; 如果(class_exists('Skrivener_Pins')){   //实例化类   $ pin_handler =新Skrivener_Pins();   //获取销,Pinterest的返回25元批,通过这一功能的网页递归,在限传递给   //对页数覆盖默认限制检索,避免高限(如20 * 25针/页限值为500引脚拉   //和20个独立的呼叫Pinterest的)   $ pins1 = $ pin_handler-&GT; get_tagged_pins($标记,2);   //显示引脚演示目的   回声'&LT; H1&GT;图像上Pinterest的提'&LT; / H1&gt;中$标签。' 。 \ N的;   如果($ pins1!= FALSE){     回声'&LT; P&GT;&LT; EM&GT; 。计数($ pins1)。 找到的图片&LT; / EM&GT;&LT; / P&GT; 。 \ N的;     skrivener_dump_images($ pins1,5);   } 其他 {     回声'&LT; P&GT;&LT; EM&GT;没有找到图片&LT; / EM&GT;&LT; / P&GT; 。 \ N的;   } } //演示功能,转储图像阵列HTML的img标签,可以通过限制阵列只能显示部分 功能skrivener_dump_images($ pin_array,$限制= FALSE){   如果(is_array($ pin_array)){     如果($限制)$ pin_array = array_slice($ pin_array, - ($限制));     的foreach($ pin_array为$针){       回声'&LT; IMG SRC ='$针['形象'] - &GT;网址。'WIDTH ='$针['形象'] - &GT;宽。高度='$针。 ['形象'] - &GT;高度'&GT;。 。 \ N的;     }   } } ?&GT;

让我知道,如果你遇到了让这适用于您的特定终点的问题。 Apols在code任何草率,也没能生产原始。

This question sounds easy, but it is not as simple as it sounds.

Brief summary of what's wrong

For an example, use this board; http://pinterest.com/dodo/web-designui-and-mobile/

Examining the HTML for the board itself (inside the div with the class GridItems) at the top of the page yields:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 0px; left: 0px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 3343px; left: 1000px; visibility: visible;">..</div>
</div>

Yet at the bottom of the page, after activating the infinite scroll a couple of times, we get this as the HTML:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 12431px; left: 750px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 19944px; left: 750px; visibility: visible;">..</div>
</div>

As you can see, some of the containers for the images higher up on the page have disappeared, and not all of the containers for the images load upon first loading the page.


What I want to do

I want to be able to create a C# script (or any server side language at the moment) that can download the page's full HTML (i.e., retrieve every image on the page), and the images will then be downloaded from their URLs. Downloading the webpage and using an appropriate XPath is easy, but the real challenge is downloading the full HTML for every image.

Is there a way I can emulate scrolling to the bottom of the page, or is there an even easier way that I can retrieve every image? I imagine that Pinterest use AJAX to change the HTML, is there a way I can programmatically trigger the events to receive all the HTML? Thank you in advance for suggestions and solutions, and kudos for even reading this very long question if you do not have any!

Pseudo code

using System;
using System.Net;
using HtmlAgilityPack;

private void Main() {
    string pinterestURL = "http://www.pinterest.com/...";
    string XPath = ".../img";

    HtmlDocument doc = new HtmlDocument();

    // Currently only downloads the first 25 images.
    doc.Load(strPinterestUrl);

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
    {
         image_links[] = link["src"];
         // Use image links
    }
}

解决方案

Okay, so I think this may be (with a few alterations) what you need.

Caveats:

  1. This is PHP, not C# (but you said you were interested in any server-side language).
  2. This code hooks into (unofficial) Pinterest search endpoints. You'll need to change $data and $search_res to reflect the appropriate endpoints (eg. BoardFeedResouce) for your tasks. Note: at least for search, Pinterest currently uses two endpoints, one for the initial page load, and another for the infinite scroll actions. Each has their own expected param structure.
  3. Pinterest has no official public API, expect this to break whenever they change anything, and without warning.
  4. You may find pinterestapi.co.uk easier to implement and acceptable for what you're doing.
  5. I have some demo/debug code beneath the class that shouldn't be there once you're getting the data you want, and a default page fetch limit that you may want to change.

Points of interest:

  1. The underscore _ parameter takes a timestamp in JavaScript format, ie. like Unix time but it has milliseconds added. It's not actually used for pagination.
  2. Pagination uses the bookmarks property, so you make the first request to the 'new' endpoint which doesn't require it, and then take the bookmarks from the result and use it in your request to get the next 'page' of results, take the bookmarks from those results to fetch the next page after that, and so on until you run out of results or reach your pre-set limit (or you hit the server max for script execution time). I'd be curious to know exactly what the bookmarks field encodes. I would like to think there's some fun secret sauce beyond just a pin ID or some other page marker.
  3. I'm skipping the html, instead dealing with JSON, as it's easier (for me) than using a DOM manipulation solution, or a bunch of regex.

<?php

if(!class_exists('Skrivener_Pins')) {

  class Skrivener_Pins {

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Pinterest search function. Uses Pinterest's "internal" page APIs, so likely to break if they change.
     * @author [@skrivener] Philip Tillsley
     * @param $search_str     The string used to search for matching pins.
     * @param $limit          Max number of pages to get, defaults to 2 to avoid excessively large queries. Use care when passing in a value.
     * @param $bookmarks_str  Used internally for recursive fetches.
     * @param $pages          Used internally to limit recursion.
     * @return array()        int['id'], obj['image'], str['pin_link'], str['orig_link'], bool['video_flag']
     * 
     * TODO:
        * 
        * 
     */
    public function get_tagged_pins($search_str, $limit = 1, $bookmarks_str = null, $page = 1) {

      // limit depth of recursion, ie. number of pages of 25 returned, otherwise we can hang on huge queries
      if( $page > $limit ) return false;

      // are we getting a next page of pins or not
      $next_page = false;
      if( isset($bookmarks_str) ) $next_page = true;

      // build url components
      if( !$next_page ) {

        // 1st time
        $search_res = 'BaseSearchResource'; // end point
        $path = '&module_path=' . urlencode('SearchInfoBar(query=' . $search_str . ', scope=boards)');
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "scope":"pins",
            "show_scope_selector":true,
            "query":"' . $search_str . '"
          },
          "context":{
            "app_version":"2f83a7e"
          },
          "module":{
            "name":"SearchPage",
            "options":{
              "scope":"pins",
              "query":"' . $search_str . '"
            }
          },
          "append":false,
          "error_strategy":0
          }');
      } else {

        // this is a fetch for 'scrolling', what changes is the bookmarks reference, 
        // so pass the previous bookmarks value to this function and it is included
        // in query
        $search_res = 'SearchResource'; // different end point from 1st time search
        $path = '';
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "query":"' . $search_str . '",
            "bookmarks":["' . $bookmarks_str . '"],
            "show_scope_selector":null,
            "scope":"pins"
          },
          "context":{
            "app_version":"2f83a7e"
          },
            "module":{
              "name":"GridItems",
            "options":{
              "scrollable":true,
              "show_grid_footer":true,
              "centered":true,
              "reflow_all":true,
              "virtualize":true,
              "item_options":{
                "show_pinner":true,
                "show_pinned_from":false,
                "show_board":true
              },
              "layout":"variable_height"
            }
          },
          "append":true,
          "error_strategy":2
        }');
      }
      $data = urlencode($data);
      $timestamp = time() * 1000; // unix time but in JS format (ie. has ms vs normal server time in secs), * 1000 to add ms (ie. 0ms)

      // build url
      $url = 'http://pinterest.com/resource/' . $search_res . '/get/?source_url=/search/pins/?q=' . $search_str
          . '&data=' . $data
          . $path
          . '&_=' . $timestamp;//'1378150472669';

      // setup curl
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL, $url);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));

      // get result
      $curl_result = curl_exec ($ch); // this echoes the output
      $curl_result = json_decode($curl_result);
      curl_close ($ch);

      // clear html to make var_dumps easier to see when debugging
      // $curl_result->module->html = '';

      // isolate the pin data, different end points have different data structures
      if(!$next_page) $pin_array = $curl_result->module->tree->children[1]->children[0]->children[0]->children;
      else $pin_array = $curl_result->module->tree->children;

      // map the pin data into desired format
      $pin_data_array = array();
      $bookmarks = null;
      if(is_array($pin_array)) {
        if(count($pin_array)) {

          foreach ($pin_array as $pin) {

            //setup data
            $image_id = $pin->options->pin_id;
            $image_data = ( isset($pin->data->images->originals) ) ? $pin->data->images->originals : $pin->data->images->orig;
            $pin_url = 'http://pinterest.com/pin/' . $image_id . '/';
            $original_url = $pin->data->link;
            $video = $pin->data->is_video;

            array_push($pin_data_array, array(
              "id"          => $image_id,
              "image"       => $image_data,
              "pin_link"    => $pin_url,
              "orig_link"   => $original_url,
              "video_flag"  => $video,
              ));
          }
          $bookmarks = reset($curl_result->module->tree->resource->options->bookmarks);

        } else {
          $pin_data_array = false;
        }
      }

      // recurse until we're done
      if( !($pin_data_array === false) && !is_null($bookmarks) ) {

        // more pins to get
        $more_pins = $this->get_tagged_pins($search_str, $limit, $bookmarks, ++$page);
        if( !($more_pins === false) ) $pin_data_array = array_merge($pin_data_array, $more_pins);
        return $pin_data_array;
      }

      // end of recursion
      return false;
    }

  } // end class Skrivener_Pins
} // end if



/**
 * Debug/Demo Code
 * delete or comment this section for production
 */

// output headers to control how the content displays
// header("Content-Type: application/json");
header("Content-Type: text/plain");
// header("Content-Type: text/html");

// define search term
// $tag = "vader";
$tag = "haemolytic";
// $tag = "qjkjgjerbjjkrekhjk";

if(class_exists('Skrivener_Pins')) {

  // instantiate the class
  $pin_handler = new Skrivener_Pins();

  // get pins, pinterest returns 25 per batch, function pages through this recursively, pass in limit to 
  // override default limit on number of pages to retrieve, avoid high limits (eg. limit of 20 * 25 pins/page = 500 pins to pull 
  // and 20 separate calls to Pinterest)
  $pins1 = $pin_handler->get_tagged_pins($tag, 2);

  // display the pins for demo purposes
  echo '<h1>Images on Pinterest mentioning "' . $tag . '"</h1>' . "\n";
  if( $pins1 != false ) {
    echo '<p><em>' . count($pins1) . ' images found.</em></p>' . "\n";
    skrivener_dump_images($pins1, 5);
  } else {
    echo '<p><em>No images found.</em></p>' . "\n";
  }
}

// demo function, dumps images in array to html img tags, can pass limit to only display part of array
function skrivener_dump_images($pin_array, $limit = false) {
  if(is_array($pin_array)) {
    if($limit) $pin_array = array_slice($pin_array, -($limit));
    foreach ($pin_array as $pin) {
      echo '<img src="' . $pin['image']->url . '" width="' . $pin['image']->width . '" height="' . $pin['image']->height . '" >' . "\n";
    }
  }
}

?>

Let me know if you run into problems getting this adapted to your particular end points. Apols for any sloppiness in the code, it didn't make it to production originally.

这篇关于得到董事会的所有图像从Pinterest的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆