如何从 Pocket(稍后阅读)或可读性等 HTML 页面中提取文章文本内容? [英] How to extract Article Text contents from HTML page like Pocket (Read It Later) or Readability?

查看:32
本文介绍了如何从 Pocket(稍后阅读)或可读性等 HTML 页面中提取文章文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一些开源框架或算法,通过清理 HTML 代码、删除垃圾内容来从任何 HTML 页面中提取文章文本内容,类似于 Pocket(又名稍后阅读)软件所做的.

I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.

掌上官网:http://getpocket.com/

此问题已在链接下提供:如何从 html 中提取文本内容,如稍后阅读或 InstaPaper Iphone 应用程序?但我的要求有点不同.我想通过保留字体和样式 (CSS) 来清理 HTML 并使用图像提取主要内容.

This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).

推荐答案

我推荐 NReadability,连同 HtmlAgilityPack

NReadability 对页面进行转码后,正文始终位于 id 为 readInner 的 div 中.

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}

这篇关于如何从 Pocket(稍后阅读)或可读性等 HTML 页面中提取文章文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆