如何从随机网页上抓取文字和图像? [英] How can I scrape text and images from a random web page?

查看:92
本文介绍了如何从随机网页上抓取文字和图像?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种在互联网上直观地呈现随机网页的方法.

比如说

它会抓取图像的链接,然后创建一个漂亮的有意义的图块,邀请您单击它.

我可以通过任何方式从网站上抓取图像和文字吗?我主要对Objective-C/JavaScript组合感兴趣,但是任何事情都会做,并且会被选为批准的答案.

重新撰写帖子并更改标题.

解决方案

网站通常会为用户友好的社交媒体共享提供元信息,例如标准化格式,或者尝试猜测给定网站上最突出的内容是什么(例如,首屏上方的最大图片,第一段的前几句话,标题元素中的文字等).

前一种方法的问题是,随着这些网站的变化和发展,您必须维护解析器;而后者则无法可靠地预测页面上的重要内容,并且您不能期望总能找到自己的内容.重新寻找其中一个(例如,缩略图的图像).

由于您将永远无法为100%的网站生成有意义的预览,因此可以归结为一个简单的问题.链接预览成功的可接受率是多少?如果您可以解析标准的元信息,那么我会坚持下去,省去很多麻烦.如果不是这样,除了上面共享的库之外,您还可以查看付费服务/API,这些服务/API可能比您自己拥有的使用案例更多.

I need a way to visually represent a random web page on the internet.

Let's say for example this web page.

Currently, these are the standard assets I can use:

  • Favicon: Too small, too abstract.
  • Title: Very specific but poor visual aesthetics.
  • URL: Nobody cares to read.
  • Icon: Too abstract.
  • Thumbnail: Hard to get, too ugly (many elements crammed in a small space).

I need to visually represent a random website in a way that is very meaningful and inviting for others to click on it.

I need something like what Facebook does when you share a link:

It scraps the link for images and then creates a beautiful meaningful tile which is inviting to click on.

Any way I can scrape the images and text from websites? I'm primarily interested in a Objective-C/JavaScript combo but anything will do and will be selected as an approved answer.

Edit: Re-wrote the post and changed the title.

解决方案

Websites will often provide meta information for user friendly social media sharing, such as Open Graph protocol tags. In fact, in your own example, the reddit page has Open Graph tags which make up the information in the Link Preview (look for meta tags with og: properties).

A fallback approach would be to implement site specific parsing code for most popular websites that don't already conform to a standardized format or to try and generically guess what the most prominent content on a given website is (for example, biggest image above the fold, first few sentences of the first paragraph, text in heading elements etc).

Problem with the former approach is that you you have to maintain the parsers as those websites change and evolve and with the latter that you simply cannot reliably predict what's important on a page and you can't expect to always find what you're looking for either (images for the thumbnail, for example).

Since you will never be able to generate meaningful previews for a 100% of the websites, it boils down to a simple question. What's an acceptable rate of successful link previews? If it's close to what you can get parsing standard meta information, I'd stick with that and save myself a lot of headache. If not, alternatively to the libraries shared above, you can also have a look at paid services/APIs which will likely cover more use cases than you could on your own.

这篇关于如何从随机网页上抓取文字和图像?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆