识别&提取图像的标题/描述(Data Scraping Pinterest) [英] Identify & Extract the title/description of an Image (Data Scraping Pinterest)

查看:174
本文介绍了识别&提取图像的标题/描述(Data Scraping Pinterest)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用Javascript / jQuery识别与具有多个图像和描述的网页上的图像对应的描述或标题?

How can Javascript/jQuery be used to identify the description or title corresponding to an image on a webpage with multiple images and descriptions?

可以提取页面标题非常容易,但标题可能与图像不对应,尤其是如果页面上有许多图像

The page title can be extracted very easily, but the title may not correspond to the image especially if there are many images present on the page

var title = document.title;

我相信Pinterest的Pin-it书签已成功完成。我猜它与算法有关,找到最近的 h1 h2 h3 或图像的​​ alt 属性,然后回退到 document.title 如果算法无法在页面上识别图像的描述。

I believe this has been done successfully by Pinterest's Pin-it bookmarklet. I'm guessing it has to do with an algorithm to find the nearest h1, h2, h3 or the image's alt attributes, then fallback to the document.title if the algorithm fails to identify the image's description on the page.

任何想法都非常感谢!

这是用于抓取其他网站的数据

This is for data scraping other websites

推荐答案

OP提供了一个很棒的要扩大的问题。我最近为另一个 SO答案 创建了一个jsFiddle,用于数据抓取 URL 雅虎的>,标题缩略图!屏幕视频播放器网页。

The OP has provided a great question to expand on. I recently created a jsFiddle for another SO Answer to data scrape URL, Title, and Thumbnail from the new Yahoo! Screen Video Player webpages.

我只是 重写了jsFiddle 所以它是特定的Pinterest并直接使用 Metatag对象编号 以后更多)这使得这个jsFiddle与此非常不同一个。

I have just re-written that jsFiddle so it's Pinterest specific and have made direct use of Metatag Object Numbers (more on that later) which makes this jsFiddle very different from that one.

整个过程涉及使用 Yahoo的查询语言以及jQuery .ajax()获取所需的抓取数据的功能,通常可在网页来源 metatag 部分中找到。

The overall process involves using Yahoo's Query Language along with jQuery .ajax() function to get the desired scraped data, usually available in the webpages source metatag section.



首先,让我解释一下。


First, let me explain a few things.

Pinterest Link 将是固定商品的直接链接。这意味着网页将包含主要固定项目以及许多其他较小的固定项目,这与包含大量固定项目的主页不同。

The Pinterest Link that I will use will be a direct link to a pinned item. This means that webpage will contain the primary pinned item along with many other smaller pinned items, unlike the homepage which contains a multitude of only pinned items.

Pinterest链接 网页标题固定项目的 标题 以及构成固定项目的 描述 。这很可能是不可取的,只需固定项目的 标题 即可。

That Pinterest Link has for it's Webpage Title the pinned item's Title along with a few words that makes up the pinned item's Description. This most likely is not desired, and just the pinned item's Title is all that's needed.

查看 Pinterest链接 的HTML源页面向我们展示目前使用的元标记。其中大部分是:

Viewing the HTML Source Page for the Pinterest Link shows us the metatags that are currently used. Here's most of them:

<meta property="fb:app_id" content="274266067164"/>

<meta property="og:site_name" content="Pinterest"/>
<meta property="og:type" content="pinterestapp:pin"/>
<meta property="og:url" content="http://pinterest.com/pin/40250990391375228/"/>
<meta property="og:title" content="FUNNY!!"/>
<meta property="og:description" content="Someone please do this."/>
<meta property="og:image" content="http://media-cache0.pinterest.com/upload/62980094758941134_yXgT124O_c.jpg"/>
<meta property="og:see_also" content="http://9gag.com/gag/2934786" />

<meta property="pinterestapp:pinboard" content="http://pinterest.com/amjo32/funny/"/>
<meta property="pinterestapp:pinner" content="http://pinterest.com/amjo32/"/>
<meta property="pinterestapp:source" content="http://9gag.com/gag/2934786"/>
<meta property="pinterestapp:likes" content="21"/>
<meta property="pinterestapp:repins" content="30"/>
<meta property="pinterestapp:comments" content="0"/>
<meta property="pinterestapp:actions" content="51"/>

<meta name="twitter:card" content="photo">
<meta name="twitter:url" content="http://pinterest.com/pin/40250990391375228/">
<meta name="twitter:site" content="@pinterest">

<meta name="google-site-verification" content="NvDayNupl7R0MDceeuRcs7xUf9yqUsxg6WGjEeRdAnc" />
<meta name="application-name" content="Pinterest" />
<meta name="msapplication-TileColor" content="#ffffff" />

如您所见,那些元标记包含 og:title og:image 我们所追求的数据。然后意识到这些 og元标记是执行数据抓取过程的直接目标。

As you can see, those metatags contains og:title and og:image data for which we are after. It's then realized that these og metatags are a direct target which to perform the data scraping process.

确定,上面的 os:image 内容链接是通过 _c.jpg 获取完整图片大小的版本。缩略图版本使用 _b.jpg 。基本上,每个固定项目有两个唯一的图像大小。

To be sure, the os:image content link above is for the full image size version via _c.jpg. The Thumbnail versions use _b.jpg. Essentially, you have two unique image sizes per pinned item.

由于数据抓取过程不返回这些 og属性名称,只有 Metatag对象编号,我们需要分析与之关联的 内容 每个 Metatag对象编号

Since the data scraping process does not return these og property names, only Metatag Object Numbers, we need to analyze the returned content associated with each Metatag Object Number.

查看上面的 metatag 来源,很明显图片总是位于以 http:// media- 即可。那些 13 字符在所有元标记中都是唯一的,因此当匹配时,整个网址都是 图片location

Looking at the above metatag source, it's clear that the image will always be located at some place starting with http://media-. Those 13 characters are unique among all metatags, and therefore when that's matched, that entire URL is the image location.

当然,Pinterest应该为图像使用多个URL模板,那么事情需要相应调整。

Of course should Pinterest use more than one URL Template for there images, then things will need to be adjusted accordingly.

查看 og:title 您立即意识到中没有唯一的字符串内容部分 表示此标记是图片的标题。因此,假设所有元标签都遵循模板并且不会更改一段时间,我们将分配此 Metatag对象编号7 以提供 Pinterest固定物品的图像标题 。需要说明的是,这个数字7基于此脚本流程中的 .ajax() YQL结果,而不是源HTML结构如上所示。

Looking at og:title you immediately realize that there are no unique string of characters in the content portion to indicate that this tag is the image's title. Therefore, assuming all metatags follow a template and will not change for some time, we will allocate this Metatag Object Number 7 to provide the Pinterest Pinned Item's Image Title. To be clear, this number 7 is based on .ajax() and YQL Results from this scripts process, not the source HTML structure as seen above.

同样,如果Pinterest更改了 head部分的模板,则可能会进行调整需要。

Again, if Pinterest changes there template for the head section, then adjustments may be required.

现在接下来是一个实时的分步教程我写的,基于此在线中看到的数据抓取技术/脚本< a href =http://ninjagirl.com/posts/2012/02/create-facebook-style-link-preview-using-jquery-yql =nofollow noreferrer>文章。

What follows now is an live step by step tutorial I wrote, based on data scraping techniques/script seen in this online article.


jsFiddle Pinterest Data Scraping DEMO


提示:

虽然没有证明,但您可以使用数值对于总共发现元标记,可以与预先确定的格式进行核对页面 应该 包含的内容值,表示 head部分已更改。例如,当前元标记计数为 25 项。如果返回的值等于任何其他 Pinterest固定商品网页上的此值,您知道有一个不同的头部正在使用...这可能会影响脚本,因为它 只需要25 并直接通过 Metatag调用其中的两个对象编号

Tip:
Although not demonstrated, at your disposal is a numeric value for total found Metatags, which can be checked against a predetermined value for what the page should contain, indicating the head section has changed. For example, the current metatag count is 25 items. If the returned value is not equal to this value on any other Pinterest Pinned Item webpage, you know there is a different head section in use... which may affect the script since it expects only 25 and calls two of them directly by it's Metatag Object Number.

额外的东西:

如果您对如何检索主页上显示的当前Pinterest Pinned ITEMS感到好奇,请先了解这个jsFiddle DEMO的工作原理。然后,您需要制作自己的jsFiddle版本进行测试并使用 Pinterest主页URL 以及更改 XPATH .ajax() 调用数据时仅搜索 相关div的 正文部分 。要了解有关 XPATH基础知识的更多信息,请单击 HERE 。然后你就可以理解: 在YQL游乐场选择Div Divs的XPATH

Something extra:
If your curious on how to retrieve the current Pinterest Pinned ITEMS as seen on the homepage, first understand how this jsFiddle DEMO works. Then, you'll need to make your own jsFiddle version for testing and use the Pinterest Homepage URL along with changing the XPATH in the .ajax() call to data scrape only the relevant div's in the body section. To learn more about XPATH basics, click HERE. Then you can understand: XPATH for Select Divs in Body on YQL Playground.

例如,正文部分包含 最大总数 50 pin 格式:

 "href": "/pin/15833036160340477/"

那些 href片段将作为重新创建URL的起点。 重要提示:某些引脚可能 repins ,这意味着您将返回少于50个引脚。

Those href fragments will serve as a starting point in recreating the URL's. Important note: Some pins may be repins which means you will have less than 50 pins returned.

对于那些读到这里的人来说,这里是:

For those that read this far, here it is:

额外的jsFiddle DEMO

这是一个改进的 在YQL游乐场中选择Divs的XPATH ,但要了解上述时间越长越好。

Here is an improved XPATH for Select Divs in Body on YQL Playground, but do understand how the longer one above works.

另见my其他Pinterest SO答案:

自定义Pinterest按钮,用于自定义URL(文本链接,图像或两者)

Custom Pinterest button for custom URL (Text-Link, Image, or Both)

如何复制Pinterest网站的模态效果?

这篇关于识别&amp;提取图像的标题/描述(Data Scraping Pinterest)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆