这是为什么HtmlAgilityPack操作无效时,的确都是匹配的元素? [英] Why is this HtmlAgilityPack operation invalid when there are, indeed, matching elements?

查看:209
本文介绍了这是为什么HtmlAgilityPack操作无效时,的确都是匹配的元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到InvalidOperationException异常>消息=序列中没有匹配的元素用下面的代码:

 私人无效buttonLoadHTML_Click(对象发件人,EventArgs五)
{
GetParagraphsListFromHtml(@C:\PlatypiRUs\fitt.html);
}

//此代码改编自柯克沃尔的回答为
http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph -
解析-问题
公开名单<串GT; GetParagraphsListFromHtml(字符串sourceHtml)
{
变种收杆=新的List<串GT;();
HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
的foreach(在doc.DocumentNode
.DescendantNodes VAR杆()
。单(X => x.Id ==身体)
.DescendantNodes()
。凡(X => x.Name ==p))
//.Where(x = GT; x.Name ==H1|| x.Name ==H2 || x.Name ==H3|| x.Name
==HP||))< - 这是我真正喜欢做的事,但我不知道是否
,这是可能的,或者,如果它是,如果语法正确
{
pars.Add(par.InnerText);
}
//测试
的foreach(字符串s在标准杆)
{
MessageBox.Show(S);
}
返回收杆;
}



为什么代码没有找到段落?



我真的想找到的所有的文本(h1..3或更高丘壑,太),但这是一个开始。



BTW:HTML文件,我用的确实的有一些段落元素



更新测试



在回应艾米的暗示请求,并在充分披露/终极照明的利益,这里是整个测试HTML文件:

 <风格> 
体{
背景色:橙色;
FONT-FAMILY:宋体,无衬线;
}

H1 {
颜色:蓝色;
FONT-FAMILY:'UI的Segoe,宋体,无衬线;
}

H2 {
颜色:白色;
FONT-FAMILY:'帕拉提诺行型活字,帕拉提诺,无衬线;
}

H3 {
显示:inline-block的;
}
< /风格>

< H1>在翻译<值; / H1>
< H2>经典文学与LT的双语版本; / H>
< D​​IV><标签>联系人:LT; /标签>< A HREF =邮寄地址:axx3andspace@gmail.com>在翻译和LT发现; / A>< / DIV>

< H2>< 80天<环游世界;举> /举>由儒勒·凡尔纳(法语和放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1495308081目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00I0DOYRE目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg/>&下; / A>

< H2><&举GT;格列佛游记< /&举GT;由乔纳森·斯威夫特(英语和放大器;放大器;法语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1495374688目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00I5319ZO目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg/>&下; / A>

< H2><&举GT;西游记地球<的中心; /&举GT;由儒勒·凡尔纳(法语和放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1495409031目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00I6LG25M目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg/>&下; / A>

< H2><&举GT;金银岛< /&举GT;由罗伯特·路易斯·史蒂文森(英语和放大器;放大器;通过边侧芬兰)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1495418936目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00IA5V4KC目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg/>&下; / A>

< H2><&举GT;鲁滨逊漂流记< /&举GT;由丹尼尔·笛福(英语和放大器;放大器;法语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1495448053目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00I9IE8OY目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg/>&下; / A>

< H2><&举GT;堂吉诃德< /&举GT;由塞万提斯·萨维德拉(西班牙语&放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3和GT;< / BR>
< H3>体积I< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/149474967X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg/>&下; / A>
< H3>卷二< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1494803445目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg/>&下; / A>
< H3>第三卷< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/1494841983目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg/>< / A>< / BR>
< H3>的Kindle< / H3和GT;< / BR>
< H3>体积I< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00HQMWPQ2目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg/>&下; / A>
< H3>卷二< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00HYN2QGM目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg/>&下; / A>
< H3>第三卷< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00HLX519E目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg/>< / A>< / BR>

< H2><&举GT;爱丽丝梦游仙境< /&举GT;刘易斯·卡罗尔(英语和放大器;放大器;并排德方)LT; / H>
< H3>即将推出;现在,请参阅:LT; / H3和GT;< / BR />
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/193659420X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00ESLTIYQ目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg/>&下; / A>

< H2><&举GT;爱丽丝梦游仙境< /&举GT;刘易斯·卡罗尔(英语和放大器;放大器;并排意方); / H2>其中p
< H3>即将推出;现在,请参阅:LT; / H3和GT;< / BR />
< H3>平装< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/193659420X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =htt​​p://rads.stackoverflow.com/amzn/click/B00ESLTIYQ目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg/>&下; / A>

< H2>其他站点:< / H>
< P>< A HREF =htt​​p://usamaporama.azurewebsites.net/目标=_空白>美国地图-O-拉玛<​​ / A>< / P>
< P>< A HREF =htt​​p://www.awardwinnersonly.com/目标=_空白>屡获殊荣的电影,书籍和音乐< / A>< / P>
< P>< A HREF =htt​​p://www.bigsurgarrapata.com/目标=_空白> Garrapata州立公园大苏尔整个季节和LT; / A>< / P>



更新2



这工作(虽然它是活的网页,而不是HTML文件保存到磁盘):

 公开名单<串> GetParagraphsListFromHtml(字符串sourceHtml)
{
变种收杆=新的List<串GT;();
HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);

变种getHtmlWeb =新HtmlWeb();
变种文件= getHtmlWeb.Load(http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html);
//http://www.bigsurgarrapata.com/只返回了一个段落
// http://usamaporama.azurewebsites.net/< - 无
// HTTP:/ /www.awardwinnersonly.com/< - 同bigsurgarrapata
VAR属性标记= document.DocumentNode.SelectNodes(// p);
INT计数器= 1;
如果(属性标记!= NULL)
{
的foreach(VAR在属性标记PTAG)
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
计数器++;
}
}
MessageBox.Show(完成了!);
返回收杆;
}


解决方案

原来是相当简单;这是不完整的,但这个由这个答案启发,也足以上手:

  HtmlAgilityPack.HtmlDocument HTMLDOC =新HtmlAgilityPack.HtmlDocument(); 

//有多种选择,根据需要设置
htmlDoc.OptionFixNestedTags = TRUE;

htmlDoc.Load(@C:\Platypus\dplatypus.htm);

如果(htmlDoc.DocumentNode!= NULL)
{
IEnumerable的< HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes(//文本());
的foreach(在textNodes HtmlNode节点)
{
如果(string.IsNullOrWhiteSpace(node.InnerText)!)
{
MessageBox.Show(node.InnerText);
}
}
}


I get "InvalidOperationException > Message=Sequence contains no matching element" with the following code:

private void buttonLoadHTML_Click(object sender, EventArgs e)
{
    GetParagraphsListFromHtml(@"C:\PlatypiRUs\fitt.html");
}

// This code adapted from Kirk Woll's answer at 
   http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph-
   parsing-problem
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(sourceHtml);
    foreach (var par in doc.DocumentNode
        .DescendantNodes()
        .Single(x => x.Id == "body")
        .DescendantNodes()
        .Where(x => x.Name == "p"))
        //.Where(x => x.Name == "h1" || x.Name == "h2" || x.Name == "h3" || x.Name 
           == "hp" || )) <-- This is what I'd really like to do, but I don't know if   
           this is possible or, if it is, if the syntax is correct
    {
        pars.Add(par.InnerText);
    }
    // test
    foreach (string s in pars)
    {
        MessageBox.Show(s);
    }
    return pars;
}

Why is the code not finding the paragraphs?

I really want to find all the text (h1..3 or higher vals, too), but this is a start.

BTW: The html file I'm testing with does have some paragraph elements.

UPDATE

In response to Amy's implied request, and in the interest of full disclosure/ultimate illumination, here is the entire test html file:

<style>
body {
    background-color: orange;
    font-family: Verdana, sans-serif;
}

h1 {
    color: Blue;   
    font-family: 'Segoe UI', Verdana, sans-serif;
}

h2 {
    color: white;    
    font-family: 'Palatino Linotype', 'Palatino', sans-serif;
}

h3 {
    display: inline-block;
}
</style>

<h1>Found in the Translation</h1>
<h2>Bilingual Editions of Classic Literature</h2>
<div><label>Contact: </label><a href="mailto:axx3andspace@gmail.com">Found in the Translation</a></div>

<h2><cite>Around the World in 80 Days</cite> by Jules Verne (French &amp; English Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495308081" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I0DOYRE" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>

<h2><cite>Gulliver's Travels</cite> by Jonathan Swift (English &amp; French Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495374688" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I5319ZO" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>

<h2><cite>Journey to the Center of the Earth</cite> by Jules Verne (French &amp; English Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495409031" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I6LG25M" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg" /></a>

<h2><cite>Treasure Island</cite> by Robert Louis Stevenson (English &amp; Finnish Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495418936" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00IA5V4KC" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg" /></a>

<h2><cite>Robinson Crusoe</cite> by Daniel Defoe (English &amp; French Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495448053" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I9IE8OY" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg" /></a>

<h2><cite>Don Quixote</cite> by Miguel de Cervantes Saavedra (Spanish &amp; English Side by Side)</h2>
<h3>Paperback</h3></br>
<h3>Volume I</h3>
<a href="http://rads.stackoverflow.com/amzn/click/149474967X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1494803445" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1494841983" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h3>Kindle</h3></br>
<h3>Volume I</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HQMWPQ2" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HYN2QGM" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HLX519E" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>

<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English &amp; German Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/193659420X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>

<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English &amp; Italian Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/193659420X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>

<h2>Other Sites:</h2>
<p><a href="http://usamaporama.azurewebsites.net/"  target="_blank">USA Map-O-Rama</a></p>
<p><a href="http://www.awardwinnersonly.com/"  target="_blank">Award-winning Movies, Books, and Music</a></p>
<p><a href="http://www.bigsurgarrapata.com/"  target="_blank">Garrapata State Park in Big Sur Throughout the Seasons</a></p>

UPDATE 2

This works (although it is with "live" web pages, and not html files saved to disk):

public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(sourceHtml);

    var getHtmlWeb = new HtmlWeb();
    var document = getHtmlWeb.Load("http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html"); 
    //http://www.bigsurgarrapata.com/ only returned one paragraph
    // http://usamaporama.azurewebsites.net/ <-- none
    // http://www.awardwinnersonly.com/ <- same as bigsurgarrapata
    var pTags = document.DocumentNode.SelectNodes("//p");
    int counter = 1;
    if (pTags != null)
    {
        foreach (var pTag in pTags)
        {
            pars.Add(pTag.InnerText);
            MessageBox.Show(pTag.InnerText);
            counter++;
        }
    }
    MessageBox.Show("done!");
    return pars;
}

解决方案

It turns out to be pretty easy; this is not complete, but this, inspired by this answer, is enough to get started:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;

htmlDoc.Load(@"C:\Platypus\dplatypus.htm");

if (htmlDoc.DocumentNode != null)
{
    IEnumerable<HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes("//text()");
    foreach (HtmlNode node in textNodes)
    {
        if (!string.IsNullOrWhiteSpace(node.InnerText))
        {
            MessageBox.Show(node.InnerText);
        }
    }
}

这篇关于这是为什么HtmlAgilityPack操作无效时,的确都是匹配的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆