识别&提取图像的标题/描述(数据抓取 Pinterest) [英] Identify & Extract the title/description of an Image (Data Scraping Pinterest)

查看:23
本文介绍了识别&提取图像的标题/描述(数据抓取 Pinterest)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在多张图片和描述的网页上使用Javascript/jQuery识别图片对应的描述或标题?

How can Javascript/jQuery be used to identify the description or title corresponding to an image on a webpage with multiple images and descriptions?

页面标题很容易提取,但标题可能与图片不对应,尤其是页面上有很多图片时

The page title can be extracted very easily, but the title may not correspond to the image especially if there are many images present on the page

var title = document.title;

我相信 Pinterest 的 Pin-it 书签已经成功地做到了这一点.我猜这与找到最近的 h1h2h3 或图像的 alt 属性,如果算法无法识别页面上的图像描述,则回退到 document.title.

I believe this has been done successfully by Pinterest's Pin-it bookmarklet. I'm guessing it has to do with an algorithm to find the nearest h1, h2, h3 or the image's alt attributes, then fallback to the document.title if the algorithm fails to identify the image's description on the page.

非常感谢任何想法!

这是用于抓取其他网站的数据

This is for data scraping other websites

推荐答案

OP 提供了一个很好的问题来扩展.我最近为另一个 SO Answer 创建了一个 jsFiddle 来抓取数据URL标题缩略图来自新的 Yahoo!屏幕视频播放器网页.

The OP has provided a great question to expand on. I recently created a jsFiddle for another SO Answer to data scrape URL, Title, and Thumbnail from the new Yahoo! Screen Video Player webpages.

我刚刚重写了那个 jsFiddle 所以它是 Pinterest 特有的,并且直接使用了 Metatag Object Numbers(稍后会详细介绍)这使得这个 jsFiddle 与那个非常不同.

I have just re-written that jsFiddle so it's Pinterest specific and have made direct use of Metatag Object Numbers (more on that later) which makes this jsFiddle very different from that one.

整个过程涉及使用 Yahoo 的查询语言 和 jQuery .ajax() 函数来获取所需的抓取数据,通常在网页源 metatag 中可用 部分.

The overall process involves using Yahoo's Query Language along with jQuery .ajax() function to get the desired scraped data, usually available in the webpages source metatag section.


首先,让我解释一些事情.


First, let me explain a few things.

我将使用的 Pinterest 链接 将是一个 <强>直接链接到固定项目.这意味着网页将包含主要的固定项目以及许多其他较小的固定项目,这与包含大量仅固定项目的主页不同.

The Pinterest Link that I will use will be a direct link to a pinned item. This means that webpage will contain the primary pinned item along with many other smaller pinned items, unlike the homepage which contains a multitude of only pinned items.

Pinterest 链接网页标题 固定项目的Title 以及构成固定项目Description 的几个词.这很可能不是我们想要的,只需要固定项目的 Title 即可.

That Pinterest Link has for it's Webpage Title the pinned item's Title along with a few words that makes up the pinned item's Description. This most likely is not desired, and just the pinned item's Title is all that's needed.

查看 Pinterest 链接 的 HTML 源页面向我们展示当前使用的元标记.以下是其中大部分:

Viewing the HTML Source Page for the Pinterest Link shows us the metatags that are currently used. Here's most of them:

<meta property="fb:app_id" content="274266067164"/>

<meta property="og:site_name" content="Pinterest"/>
<meta property="og:type" content="pinterestapp:pin"/>
<meta property="og:url" content="http://pinterest.com/pin/40250990391375228/"/>
<meta property="og:title" content="FUNNY!!"/>
<meta property="og:description" content="Someone please do this."/>
<meta property="og:image" content="http://media-cache0.pinterest.com/upload/62980094758941134_yXgT124O_c.jpg"/>
<meta property="og:see_also" content="http://9gag.com/gag/2934786" />

<meta property="pinterestapp:pinboard" content="http://pinterest.com/amjo32/funny/"/>
<meta property="pinterestapp:pinner" content="http://pinterest.com/amjo32/"/>
<meta property="pinterestapp:source" content="http://9gag.com/gag/2934786"/>
<meta property="pinterestapp:likes" content="21"/>
<meta property="pinterestapp:repins" content="30"/>
<meta property="pinterestapp:comments" content="0"/>
<meta property="pinterestapp:actions" content="51"/>

<meta name="twitter:card" content="photo">
<meta name="twitter:url" content="http://pinterest.com/pin/40250990391375228/">
<meta name="twitter:site" content="@pinterest">

<meta name="google-site-verification" content="NvDayNupl7R0MDceeuRcs7xUf9yqUsxg6WGjEeRdAnc" />
<meta name="application-name" content="Pinterest" />
<meta name="msapplication-TileColor" content="#ffffff" />

如您所见,那些 metatags 包含我们所追求的 og:titleog:image 数据.然后意识到这些 og metatags 是执行数据抓取过程的直接目标.

As you can see, those metatags contains og:title and og:image data for which we are after. It's then realized that these og metatags are a direct target which to perform the data scraping process.

可以肯定的是,上面的 os:image 内容链接是通过 _c.jpg 获得的完整图像尺寸版本.缩略图版本使用 _b.jpg.本质上,每个固定项目都有两个不同的图像尺寸.

To be sure, the os:image content link above is for the full image size version via _c.jpg. The Thumbnail versions use _b.jpg. Essentially, you have two unique image sizes per pinned item.

由于数据抓取过程没有返回这些og属性名,只有Metatag Object Numbers,我们需要分析返回的content 与每个 Metatag Object Number 相关联.

Since the data scraping process does not return these og property names, only Metatag Object Numbers, we need to analyze the returned content associated with each Metatag Object Number.

查看上面的 metatag 源代码,很明显 image 将始终位于以 http://media- 开头的某个位置.那些 13 个字符在所有元标记中都是唯一的,因此当它匹配时,整个 URL 就是 图像位置.

Looking at the above metatag source, it's clear that the image will always be located at some place starting with http://media-. Those 13 characters are unique among all metatags, and therefore when that's matched, that entire URL is the image location.

当然,如果 Pinterest 为这些图片使用多个 URL 模板,则需要相应地调整.

Of course should Pinterest use more than one URL Template for there images, then things will need to be adjusted accordingly.

查看og:title,您立即意识到内容部分中没有唯一的字符串来表明该标签是图像的标题.因此,假设所有元标签都遵循一个模板并且在一段时间内不会改变,我们将分配这个元标签对象编号 7 来提供 Pinterest Pinned Item 的图片标题.需要明确的是,这个数字 7 是基于此脚本过程中的 .ajax()YQL 结果,而不是如上所示的源 HTML 结构.

Looking at og:title you immediately realize that there are no unique string of characters in the content portion to indicate that this tag is the image's title. Therefore, assuming all metatags follow a template and will not change for some time, we will allocate this Metatag Object Number 7 to provide the Pinterest Pinned Item's Image Title. To be clear, this number 7 is based on .ajax() and YQL Results from this scripts process, not the source HTML structure as seen above.

同样,如果 Pinterest 更改了 head section 的模板,则可能需要进行调整.

Again, if Pinterest changes there template for the head section, then adjustments may be required.

接下来是我编写的实时分步教程,基于此在线文章.

What follows now is an live step by step tutorial I wrote, based on data scraping techniques/script seen in this online article.


jsFiddle Pinterest 数据抓取演示


提示:
虽然没有演示,但您可以使用找到的元标记总数数字值,可以根据预定值检查页面应该 包含,表示head section 发生了变化.例如,当前元标记计数为 25 项.如果返回的值等于任何其他Pinterest Pinned Item网页上的该值,您就知道使用了不同的head section... 这可能会影响脚本,因为它预计只有 25 并通过它的 Metatag Object Number 直接调用其中的两个>.

Tip:
Although not demonstrated, at your disposal is a numeric value for total found Metatags, which can be checked against a predetermined value for what the page should contain, indicating the head section has changed. For example, the current metatag count is 25 items. If the returned value is not equal to this value on any other Pinterest Pinned Item webpage, you know there is a different head section in use... which may affect the script since it expects only 25 and calls two of them directly by it's Metatag Object Number.

额外的东西:
如果您对如何检索主页上显示的当前 Pinterest 固定项目感到好奇,请首先了解此 jsFiddle DEMO 的工作原理.然后,您需要制作自己的 jsFiddle 版本以进行测试并使用 Pinterest 主页 URL 以及更改 XPATHXPATH>.ajax() 调用数据只抓取body 部分相关div的代码>.要了解有关 XPATH 基础知识 的更多信息,请单击此处.那么你可以理解:YQL Playground 正文中选择 Div 的 XPATH.

Something extra:
If your curious on how to retrieve the current Pinterest Pinned ITEMS as seen on the homepage, first understand how this jsFiddle DEMO works. Then, you'll need to make your own jsFiddle version for testing and use the Pinterest Homepage URL along with changing the XPATH in the .ajax() call to data scrape only the relevant div's in the body section. To learn more about XPATH basics, click HERE. Then you can understand: XPATH for Select Divs in Body on YQL Playground.

例如,body 部分包含最大总数50个引脚,格式如下:>

For example, the body section contains a maximum total of 50 pin's in this format:

 "href": "/pin/15833036160340477/"

那些 href 片段 将作为重新创建 URL 的起点.重要提示:有些图钉可能是repins,这意味着您退回的图钉少于 50 个.

Those href fragments will serve as a starting point in recreating the URL's. Important note: Some pins may be repins which means you will have less than 50 pins returned.

对于那些读到这里的人,这里是:

For those that read this far, here it is:

额外的 jsFiddle 演示.

这里是改进的 XPATH用于在 YQL Playground 上选择正文中的 Div,但要了解上面较长的 Div 是如何工作的.

Here is an improved XPATH for Select Divs in Body on YQL Playground, but do understand how the longer one above works.

另请参阅我的其他 Pinterest SO 答案:

用于自定义 URL(文本链接、图片或两者)的自定义 Pinterest 按钮

如何复制 Pinterest 网站的模态效果?

这篇关于识别&amp;提取图像的标题/描述(数据抓取 Pinterest)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆