Html Agility Pack通过课程获得所有元素 [英] Html Agility Pack get all elements by class

查看:98
本文介绍了Html Agility Pack通过课程获得所有元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



例如:

  var findclasses = _doc.DocumentNode.Descendants(div)。Where(d => d.Attributes.Contains(class)); 

然而,显然你可以将类添加到更多的div中,所以我尝试了这个.. p>

  var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes(// * [@ class = \float\]); 

但是,这并不处理添加多个类的情况,float只是其中一个他们是这样的。

  class =className float anotherclassName

有没有办法处理所有这些?我基本上想要选择所有具有class =并且包含float的节点。



**答案已经在我的博客上有记录,详细解释如下: HTML Agility Pack按类别获取所有元素

解决方案

(Updated 2018-03-17)

问题:



正如您发现的那样,问题在于 String.Contains 不会执行字边界检查,所以 Contains(float)将返回 true 用于foo float bar(正确)和unfloating(它是不正确)。



解决方案是确保浮动(或任何你想要的类名)出现在字边两端。字边界可以是字符串(或行),空格,特定标点符号等的开始(或结尾)。在大多数正则表达式中,这是 \ b 。所以你想要的正则表达式就是: \bfloat\b



使用 Regex 实例是,如果您不使用 .Compiled 选项,它们可能会运行缓慢 - 并且它们可能会很慢编译。所以你应该缓存正则表达式实例。如果您在运行时查找类名称,则更加困难。



另外,您可以通过字边界搜索字符串而不使用正则表达式将正则表达式实现为C#字符串处理函数,注意不要导致任何新字符串或其他对象分配(例如,不要使用 String.Split )。 b
$ b

方法1:使用正则表达式:



假设您只想查找指定了单一设计时间的元素class-name:

 类程序{

private static readonly Regex _classNameRegex = new Regex(@ \bfloat\b,RegexOptions.Compiled);

private static IEnumerable< HtmlNode> GetFloatElements(的HTMLDocument DOC){
返回文档
.Descendants()
。凡(N => n.NodeType == NodeType.Element)
。凡(E => ; e.Name ==div&& _classNameRegex.IsMatch(e.GetAttributeValue(class,)));




$ b $ p $如果你需要选择一个类名然后你可以构建一个正则表达式:

  private static IEnumerable< HtmlNode> GetElementsWithClass(DOC的HTMLDocument,字符串的className){

正则表达式的正则表达式=新的Regex( \\b + Regex.Escape(类名)+ \\b,RegexOptions.Compiled) ;

return doc
.Descendants()
.Where(n => n.NodeType == NodeType.Element)
.Where(e => e .Name ==div&& regex.IsMatch(e.GetAttributeValue(class,)));

$ / code>

如果您有多个类名并且想要匹配所有的类名,你可以创建一个 Regex 对象的数组,并确保它们全部匹配,或者将它们组合成一个 Regex 周围,​​但这导致在可怕的复杂表达式 - 所以使用 Regex [] 可能会更好:

 使用System.Linq的; 

private static IEnumerable< HtmlNode> GetElementsWithClass(HtmlDocument doc,String [] classNames){

Regex [] exprs = new Regex [classNames.Length];
for(Int32 i = 0; i< exprs.Length; i ++){
exprs [i] = new Regex(\\)+ Regex.Escape(classNames [i]) +\\b,RegexOptions.Compiled);
}

return doc
.Descendants()
.Where(n => n.NodeType == NodeType.Element)
.Where e =>
e.Name ==div&&
exprs.All(r =>
r.IsMatch(e.GetAttributeValue(class,) ))

);
}



方法2:使用非正则表达式字符串匹配:



使用自定义C#方法进行字符串匹配而不是正则表达式的好处在于,假设性能更快,内存使用量减少(尽管 Regex 可能在某些情况下要更快一些 - 总是先分析你的代码,孩子们!)



下面的这个方法: CheapClassListContains 提供了一个快速的字边界检查字符串匹配功能,可以像 regex.IsMatch 一样使用:

  private static IEnumerable< HtmlNode> GetElementsWithClass(DOC的HTMLDocument,字符串的className){

返回文档
.Descendants()
。凡(N => n.NodeType == NodeType.Element)
。其中(e =>
e.Name ==div&&
CheapClassListContains(
e.GetAttributeValue(class,),
className,
StringComparison.Ordinal

);
}

///< summary>执行可选的空白填充的字符串搜索,无需新的字符串分配。< / summary>
///< remarks>正则表达式也可能正常工作,但是每次调用这个方法时构造一个新的正则表达式会很昂贵。< / remarks>
private static Boolean CheapClassListContains(String haystack,String needle,StringComparison comparison)
{
if(String.Equals(haystack,needle,comparison))return true;
Int32 idx = 0;
while(idx + needle.Length <= haystack.Length)
{
idx = haystack.IndexOf(needle,idx,comparison);
if(idx == -1)返回false;

Int32 end = idx + needle.Length;

//针必须包含在空格中或者位于字符串的开始/结尾
布尔型validStart = idx == 0 || Char.IsWhiteSpace(haystack [idx - 1]);
布尔型validEnd = end == haystack.Length || Char.IsWhiteSpace(干草堆[结束]);
if(validStart&& validEnd)return true;

idx ++;
}
返回false;
}



方法3:使用CSS选择器库:



HtmlAgilityPack有些停滞不支持 .querySelector .querySelectorAll ,但那里是扩展HtmlAgilityPack与它第三方库:即 Fizzler 和< a href =https://www.nuget.org/packages/HtmlAgilityPack.CssSelectors/ =nofollow noreferrer> CssSelectors 。 Fizzler和CssSelectors都实现 QuerySelectorAll ,所以你可以像这样使用它:

  private static IEnumerable< HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc){

return doc.QuerySelectorAll(div.float);
}

使用运行时定义的类:

  private static IEnumerable< HtmlNode> GetDivElementsWithClasses(HtmlDocument doc,IEnumerable< String> classNames){

String selector =div。 + String.Join(。,classNames);

返回doc.QuerySelectorAll(选择器);
}


I am taking a stab at html agility pack and having trouble finding the right way to go about this.

For example:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class"));

However, obviously you can add classes to a lot more then divs so I tried this..

var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes("//*[@class=\"float\"]");

But that doesn't handle the cases where you add multiple classes and "float" is just one of them like this..

class="className float anotherclassName"

Is there a way to handle all of this? I basically want to select all nodes that have a class = and contains float.

**Answer has been documented on my blog with a full explanation at: Html Agility Pack Get All Elements by Class

解决方案

(Updated 2018-03-17)

The problem:

The problem, as you've spotted, is that String.Contains does not perform a word-boundary check, so Contains("float") will return true for both "foo float bar" (correct) and "unfloating" (which is incorrect).

The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b. So the regex you want is simply: \bfloat\b.

A downside to using a Regex instance is that they can be slow to run if you don't use the .Compiled option - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.

Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split).

Approach 1: Using a regular-expression:

Suppose you just want to look for elements with a single, design-time specified class-name:

class Program {

    private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );

    private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
        return doc
            .Descendants()
            .Where( n => n.NodeType == NodeType.Element )
            .Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
    }
}

If you need to choose a single class-name at runtime then you can build a regex:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled );

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}

If you have multiple class-names and you want to match all of them, you could create an array of Regex objects and ensure they're all matching, or combine them into a single Regex using lookarounds, but this results in horrendously complicated expressions - so using a Regex[] is probably better:

using System.Linq;

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {

    Regex[] exprs = new Regex[ classNames.Length ];
    for( Int32 i = 0; i < exprs.Length; i++ ) {
        exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled );
    }

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            exprs.All( r =>
                r.IsMatch( e.GetAttributeValue("class", "") )
            )
        );
}

Approach 2: Using non-regex string matching:

The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regex may be faster in some circumstances - always profile your code first, kids!)

This method below: CheapClassListContains provides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

    return doc
        .Descendants()
        .Where( n => n.NodeType == NodeType.Element )
        .Where( e =>
            e.Name == "div" &&
            CheapClassListContains(
                e.GetAttributeValue("class", ""),
                className,
                StringComparison.Ordinal
            )
        );
}

/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
    if( String.Equals( haystack, needle, comparison ) ) return true;
    Int32 idx = 0;
    while( idx + needle.Length <= haystack.Length )
    {
        idx = haystack.IndexOf( needle, idx, comparison );
        if( idx == -1 ) return false;

        Int32 end = idx + needle.Length;

        // Needle must be enclosed in whitespace or be at the start/end of string
        Boolean validStart = idx == 0               || Char.IsWhiteSpace( haystack[idx - 1] );
        Boolean validEnd   = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
        if( validStart && validEnd ) return true;

        idx++;
    }
    return false;
}

Approach 3: Using a CSS Selector library:

HtmlAgilityPack is somewhat stagnated doesn't support .querySelector and .querySelectorAll, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzler and CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll, so you can use it like so:

private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {

    return doc.QuerySelectorAll( "div.float" );
}

With runtime-defined classes:

private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {

    String selector = "div." + String.Join( ".", classNames );

    return doc.QuerySelectorAll( selector  );
}

这篇关于Html Agility Pack通过课程获得所有元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆