如何从HTML页面中提取文章正文内容像口袋(阅读后)或可读性? [英] How to extract Article Text contents from HTML page like Pocket (Read It Later) or Readability?

查看:352
本文介绍了如何从HTML页面中提取文章正文内容像口袋(阅读后)或可读性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要寻找一些开源的框架或算法通过清洁HTML code,去除垃圾的东西,类似于掌上(又名读更高版本)软件完成提取任何HTML页的文章的文本内容。

I am looking for some open source framework or algorithm to extract article text contents from any HTML page by cleaning the HTML code, removing garbage stuff, similar to what Pocket (aka Read It Later) software does.

掌上官方网页: http://getpocket.com/

这个问题已经可以在链接: <一href="http://stackoverflow.com/questions/5960948/how-to-extract-text-contents-from-html-like-read-it-later-or-instapaper-iphone-a">How从HTML提取文本内容就像后来读它,或者Instapaper的iPhone应用程序? 但我的要求是有点不同。我要清理HTML和提取主要内容由$ P $图像pserving的字体和样式(CSS)。

This question is already available under link: How to extract text contents from html like Read it later or InstaPaper Iphone app? but my requirement is bit different. I want to clean the HTML and extract main contents with images by preserving the font and style (CSS).

推荐答案

我会建议 NReadability ,加上 HtmlAgilityPack

正文始终是DIV ID为 readInner 后NReadability跨codeD的页面。

Main text is always in div with id readInner after NReadability transcoded the page.

//** replace this with any url **
string url = "http://www.bbc.co.uk/news/world-asia-19457334";

var t = new NReadability.NReadabilityWebTranscoder();
bool b;
string page = t.Transcode(url, out b);

if (b)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(page);

    var title = doc.DocumentNode.SelectSingleNode("//title").InnerText;
    var imgUrl = doc.DocumentNode.SelectSingleNode("//meta[@property='og:image']").Attributes["content"].Value;
    var mainText = doc.DocumentNode.SelectSingleNode("//div[@id='readInner']").InnerText;
}

这篇关于如何从HTML页面中提取文章正文内容像口袋(阅读后)或可读性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆