使用HTMLAgilityPack提取页面文本 [英] extracting just page text using HTMLAgilityPack
问题描述
好吧,所以我真的是HTMLAgilityPack中使用的XPath查询的新手。
Ok so i am really new to XPath queries used in HTMLAgilityPack.
因此,请考虑此页面 http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about -你。我想要的是仅提取页面内容,而不提取其他内容。
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
因此,我首先删除脚本和样式标签。
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
之后,我尝试使用// text()获取所有文本节点。
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
但是,我不仅得到的不仅仅是文字,我也得到了大量的/ r / n个字符。
However not only i am not getting just text i am also getting numerous /r /n characters.
在这方面,我需要一些指导。
Please i require a little guidance in this regard.
推荐答案
如果您认为脚本
和 style
节点仅具有用于孩子的文本节点,则可以使用此XPath表达式获取文本不是脚本
或样式
标记的节点,因此您无需事先删除节点:
If you consider that script
and style
nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script
or style
tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
您可以使用XPath的 normalize-space()$ c进一步排除仅空格的文本节点$ c>:
You can further exclude text nodes that are only whitespace using XPath's normalize-space()
:
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
或更短的
//*[not(self::script or self::style)]/text()[normalize-space()]
但是您仍然会获得文本节点,这些文本节点可能具有前导或尾随空格。可以在您的应用程序中按照@ aL3891的建议进行处理。
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as @aL3891 suggests.
这篇关于使用HTMLAgilityPack提取页面文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!