HtmlAgilityPack:xpath和正则表达式 [英] HtmlAgilityPack: xpath and regex
问题描述
我目前正在使用HtmlAgilityPack通过xpath查询来搜索某些内容.像这样:
I'm currently using HtmlAgilityPack to search for certain content via an xpath query. Something like this:
var col = doc.DocumentNode.SelectNodes("//*[text()[contains(., 'foo'] or @*....
现在,我想使用正则表达式搜索所有html源代码(=文本,标签和属性)中的特定内容.使用HtmlAgilityPack如何实现? HtmlAgilityPack可以处理xpath + regex还是使用regex和HtmlAgilityPack进行搜索的最佳方法是什么?
Now I want to search for specific content in all of the html sourcecode (= text, tags and attributes) using a regular expression. How can this be achived with HtmlAgilityPack? Can HtmlAgilityPack handle xpath+regex or what would be the best way of using a regex and HtmlAgilityPack to search?
推荐答案
HTML Agility Pack将基础.NET XPATH实现用于其XPATH支持.幸运的是,.NET中的XPATH是完全可扩展的(顺便说一句:很可惜,微软没有在这种精湛的技术上投入更多的资金...).
The Html Agility Pack uses the underlying .NET XPATH implementation for its XPATH support. Fortunately XPATH in .NET is fully extensible (BTW: it's a shame Microsoft doesn't invest any more in this superb technology...).
所以,让我们假设我有这个html:
So, let's suppose I have this html:
<div>hello</div>
<div>hallo</div>
这里是将同时选择两个节点的示例代码,因为它会将节点与"h.llo"正则表达式进行比较:
Here is a sample code that will select both node because it compares the nodes with the 'h.llo' regex expression:
HtmlNodeNavigator nav = new HtmlNodeNavigator("mypage.htm");
foreach (var node in SelectNodes(nav, "//div[regex-is-match(text(), 'h.llo')]"))
{
Console.WriteLine(node.OuterHtml); // should dump both div elements
}
之所以起作用,是因为我使用了特殊的Xslt/XPath上下文,其中定义了一个新的XPATH函数,称为"regex-is-match".这是SelectNodes实用程序代码:
It works because I use a special Xslt/XPath context where I have defined a new XPATH function called "regex-is-match". Here is the SelectNodes utility code:
public static IEnumerable<HtmlNode> SelectNodes(HtmlNodeNavigator navigator, string xpath)
{
if (navigator == null)
throw new ArgumentNullException("navigator");
XPathExpression expr = navigator.Compile(xpath);
expr.SetContext(new HtmlXsltContext());
object eval = navigator.Evaluate(expr);
XPathNodeIterator it = eval as XPathNodeIterator;
if (it != null)
{
while (it.MoveNext())
{
HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
if (n != null && n.CurrentNode != null)
{
yield return n.CurrentNode;
}
}
}
}
这是支持代码:
public class HtmlXsltContext : XsltContext
{
public HtmlXsltContext()
: base(new NameTable())
{
}
public override int CompareDocument(string baseUri, string nextbaseUri)
{
throw new NotImplementedException();
}
public override bool PreserveWhitespace(XPathNavigator node)
{
throw new NotImplementedException();
}
protected virtual IXsltContextFunction CreateHtmlXsltFunction(string prefix, string name, XPathResultType[] ArgTypes)
{
return HtmlXsltFunction.GetBuiltIn(this, prefix, name, ArgTypes);
}
public override IXsltContextFunction ResolveFunction(string prefix, string name, XPathResultType[] ArgTypes)
{
return CreateHtmlXsltFunction(prefix, name, ArgTypes);
}
public override IXsltContextVariable ResolveVariable(string prefix, string name)
{
throw new NotImplementedException();
}
public override bool Whitespace
{
get { return true; }
}
}
public abstract class HtmlXsltFunction : IXsltContextFunction
{
protected HtmlXsltFunction(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
{
Context = context;
Prefix = prefix;
Name = name;
ArgTypes = argTypes;
}
public HtmlXsltContext Context { get; private set; }
public string Prefix { get; private set; }
public string Name { get; private set; }
public XPathResultType[] ArgTypes { get; private set; }
public virtual int Maxargs
{
get { return Minargs; }
}
public virtual int Minargs
{
get { return 1; }
}
public virtual XPathResultType ReturnType
{
get { return XPathResultType.String; }
}
public abstract object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext);
public static IXsltContextFunction GetBuiltIn(HtmlXsltContext context, string prefix, string name, XPathResultType[] argTypes)
{
if (name == "regex-is-match")
return new RegexIsMatch(context, name);
// TODO: create other functions here
return null;
}
public static string ConvertToString(object argument, bool outer, string separator)
{
if (argument == null)
return null;
string s = argument as string;
if (s != null)
return s;
XPathNodeIterator it = argument as XPathNodeIterator;
if (it != null)
{
if (!it.MoveNext())
return null;
StringBuilder sb = new StringBuilder();
do
{
HtmlNodeNavigator n = it.Current as HtmlNodeNavigator;
if (n != null && n.CurrentNode != null)
{
if (sb.Length > 0 && separator != null)
{
sb.Append(separator);
}
sb.Append(outer ? n.CurrentNode.OuterHtml : n.CurrentNode.InnerHtml);
}
}
while (it.MoveNext());
return sb.ToString();
}
IEnumerable enumerable = argument as IEnumerable;
if (enumerable != null)
{
StringBuilder sb = null;
foreach (object arg in enumerable)
{
if (sb == null)
{
sb = new StringBuilder();
}
if (sb.Length > 0 && separator != null)
{
sb.Append(separator);
}
string s2 = ConvertToString(arg, outer, separator);
if (s2 != null)
{
sb.Append(s2);
}
}
return sb != null ? sb.ToString() : null;
}
return string.Format("{0}", argument);
}
public class RegexIsMatch : HtmlXsltFunction
{
public RegexIsMatch(HtmlXsltContext context, string name)
: base(context, null, name, null)
{
}
public override XPathResultType ReturnType { get { return XPathResultType.Boolean; } }
public override int Minargs { get { return 2; } }
public override object Invoke(XsltContext xsltContext, object[] args, XPathNavigator docContext)
{
if (args.Length < 2)
return false;
return Regex.IsMatch(ConvertToString(args[0], false, null), ConvertToString(args[1], false, null));
}
}
}
regex函数最后在称为RegexIsMatch的类中实现.这不是超级复杂.请注意,有一个实用程序函数ConvertToString试图将任何xpath事物"强制转换为非常有用的字符串.
The regex function is implemented in a class called RegexIsMatch at the end. It's not super complicated. Note there is a utility function ConvertToString that tries to coerce any xpath "thing" into a string that's very useful.
当然,使用这种技术,您可以用很少的代码定义所需的XPATH函数(我一直使用它来进行大小写转换...).
Of course, with this technology, you can define whatever XPATH function you need with very little code (I use this all the time to do upper/lower case conversions...).
这篇关于HtmlAgilityPack:xpath和正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!