如何找到并提取“主”图像在网站上 [英] How to find and extract "main" image in website

查看:117
本文介绍了如何找到并提取“主”图像在网站上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助解决问题。我需要一个程序,在给定网站的情况下,找到并提取主要图片,即代表网站的图片。 (要说这是最大第一张图片有时并非总是如此)。



应该怎么做我接近这个?有没有可以帮助我的图书馆?
Thanks!

解决方案

选项1

$ b

您可以结帐。它与Pocket和Readability的功能类似,即尝试使用一组启发式从给定网页中提取主文章。它显然也可以从那篇文章中提取主要图像,但这有点让人失望,所以每次都有60%的时间工作。

它曾经是一个Java项目,但被重写为Scala。



来自自述文件


鹅会尝试提取以下信息:


  • 主要文本文章的主要图片

  • 文章
  • 中嵌入的任何Youtube / Vimeo影片
  • 元描述

  • 元标记

  • 发布日期

/ blockquote>

请在此尝试: http://jimplush.com/blog /鹅






选项2



您可以使用Java包装器(例如 GhostDriver )来运行无头浏览器,例如 PhantomJS 。然后,获取网站并找到尺寸最大的 img 元素。 这个GhostDriver测试用例展示了如何查询元素的DOM并获取渲染大小。






选项3

使用 jsoup 等库来帮助您解析HTML。然后从所有 img 标签中获取 src 属性的值。请求您为图像找到的每个URL并测量其大小。尺寸最大的尺寸可能是网站的主要图片。


I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).

How should I approach this? Are there any libraries that could help me with this? Thanks!

解决方案

OPTION 1

You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.

It used to be a Java project but rewritten to Scala.

From the readme

Goose will try to extract the following information:

  • Main text of an article
  • Main image of article
  • Any Youtube/Vimeo movies embedded in article
  • Meta Description
  • Meta tags
  • Publish Date

Try it here: http://jimplush.com/blog/goose


OPTION 2

You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the img element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.


OPTION 3

Use a library like jsoup that helps you parse HTML. Then get the value from the src attribute from all img tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.

这篇关于如何找到并提取“主”图像在网站上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆