类似Instapaper的算法 [英] Instapaper-like algorithm

查看:83
本文介绍了类似Instapaper的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人从网页中提取内容的算法?像 instapaper

解决方案

Instapaper有两个步骤:


  1. 在页面上找到主要的内容不包括页眉,页脚,菜单等)

  2. 从此内容区块提取格式文本

要找到内容块(通常是一些html块元素,例如包含关键页文本内容的div),Instapaper使用的算法非常类似于可读性。您可以查看 readability.js的来源,以查看正在发生的事情,但其核心尝试在页面上找到文本/链接比率最高的区域,尽管它还有一些其他简单的评分指标(例如,顶部我的头,像文本比逗号,para元素等),进入启发式。

一旦你确定了根节点元素,与相关的内容,您需要对其进行格式化,如果您希望只需将包含文本的节点元素从源文档中拉出并插入到您的文档中,但实际上您可能希望移除现有样式并应用自己的样式,为标准的外观和感觉。如果您想输出为纯文本格式,只能使用Jericho的 Renderer

update1 :我还应该提到Instapaper的其他功能 - 它遵循'分页'链接(下一个或1,2,3链接),以便可以跨越原稿中多页的作品将作为单个作品呈现给您文件。



update2 我最近遇到了这个 文本提取算法比较


Does anyone of an algorithm that extracts contents from a webpage? like instapaper?

解决方案

There are two steps to what Instapaper does:

  1. Find main content block on the page (excluding headers, footers, menus etc)
  2. From this content block extract and format the text

To find the content block (typically some html block element, like a div containing the key page text content) Instapaper uses an algorithm much like the one used by readability. You can look at the source of readability.js to see what's going on, but at its core it tries to find the area on the page with the highest text/link ratio, although it has some other simple scoring metrics too (e.g. off the top of my head, things like ratio of text to commas, para elements etc) that go into the heuristics.

Once you have identified the root node element, with the relevant content, you'll need to format it, if you want you can just pull the node element containing the text out of the source document and insert it into yours, but in reality you'll probably want to remove existing styles and apply your own, for a standard look and feel. If you want to output as nice text-only you can use Jericho's Renderer.

update1: I should also mention something else Instapaper does - which is follow the 'pagination' links (the "next" or "1", "2", "3" links) of the article to their conclusion, so that a piece that may span many pages in the original will be rendered to you as a single document.

update2 I recently came across this comparison of text extraction algorithms

这篇关于类似Instapaper的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆