如何找到并提取“主”图像在网站上 [英] How to find and extract "main" image in website
问题描述
应该怎么做我接近这个?有没有可以帮助我的图书馆?
Thanks!
选项1
$ b您可以结帐鹅。它与Pocket和Readability的功能类似,即尝试使用一组启发式从给定网页中提取主文章。它显然也可以从那篇文章中提取主要图像,但这有点让人失望,所以每次都有60%的时间工作。
它曾经是一个Java项目,但被重写为Scala。
来自自述文件
鹅会尝试提取以下信息:
/ blockquote>
- 主要文本文章的主要图片
- 文章
中嵌入的任何Youtube / Vimeo影片
- 元描述
- 元标记
- 发布日期
请在此尝试: http://jimplush.com/blog /鹅
选项2
您可以使用Java包装器(例如 GhostDriver )来运行无头浏览器,例如 PhantomJS 。然后,获取网站并找到尺寸最大的
img
元素。 这个GhostDriver测试用例展示了如何查询元素的DOM并获取渲染大小。
选项3
使用 jsoup 等库来帮助您解析HTML。然后从所有
img
标签中获取src
属性的值。请求您为图像找到的每个URL并测量其大小。尺寸最大的尺寸可能是网站的主要图片。I need help tackling a problem. I need a program which, given a site, finds and extracts the "main" picture, i.e. the one which represents the site. (To say it is the biggest or the first picture is sometimes but not always true).
How should I approach this? Are there any libraries that could help me with this? Thanks!
解决方案OPTION 1
You could checkout Goose. It does something similar to what Pocket and Readability does, i.e. try to extract the main article from a given webpage using a set of heuristics. It can apparently also extract the main image from that article, but it is a bit of a hit and miss, so 60% of the time it works everytime.
It used to be a Java project but rewritten to Scala.
From the readme
Goose will try to extract the following information:
- Main text of an article
- Main image of article
- Any Youtube/Vimeo movies embedded in article
- Meta Description
- Meta tags
- Publish Date
Try it here: http://jimplush.com/blog/goose
OPTION 2
You could use a Java wrapper (e.g. GhostDriver) for running a headless browser, like PhantomJS. Then, fetch the website and find the
img
element with the largest dimensions. This GhostDriver test case shows how to query the DOM for elements and get it's renderd size.
OPTION 3
Use a library like jsoup that helps you parse HTML. Then get the value from the
src
attribute from allimg
tags. Request each URL you find for an image and measure their sizes. The one with the biggest dimensions is likely to be the website's main image.这篇关于如何找到并提取“主”图像在网站上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!