使用HTMLAgilityPack提取页面文本 [英] extracting just page text using HTMLAgilityPack

查看:595
本文介绍了使用HTMLAgilityPack提取页面文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,所以我真的是HTMLAgilityPack中使用的XPath查询的新手。

Ok so i am really new to XPath queries used in HTMLAgilityPack.

因此,请考虑此页面 http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about -你。我想要的是仅提取页面内容,而不提取其他内容。

So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.

因此,我首先删除脚本和样式标签。

So for that i first remove script and style tags.

Document = new HtmlDocument();
        Document.LoadHtml(page);
        TempString = new StringBuilder();
        foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
        {
            style.Remove();
        }
        foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
        {
            script.Remove();
        }

之后,我尝试使用// text()获取所有文本节点。

After that i am trying to use //text() to get all the text nodes.

foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
        {
            TempString.AppendLine(node.InnerText);
        }

但是,我不仅得到的不仅仅是文字,我也得到了大量的/ r / n个字符。

However not only i am not getting just text i am also getting numerous /r /n characters.

在这方面,我需要一些指导。

Please i require a little guidance in this regard.

推荐答案

如果您认为脚本 style 节点仅具有用于孩子的文本节点,则可以使用此XPath表达式获取文本不是脚本样式标记的节点,因此您无需事先删除节点:

If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:

//*[not(self::script or self::style)]/text()

您可以使用XPath的 normalize-space()

You can further exclude text nodes that are only whitespace using XPath's normalize-space():

//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]

或更短的

//*[not(self::script or self::style)]/text()[normalize-space()]

但是您仍然会获得文本节点,这些文本节点可能具有前导或尾随空格。可以在您的应用程序中按照@ aL3891的建议进行处理。

But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as @aL3891 suggests.

这篇关于使用HTMLAgilityPack提取页面文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆