这是为什么HtmlAgilityPack操作无效时,的确都是匹配的元素? [英] Why is this HtmlAgilityPack operation invalid when there are, indeed, matching elements?
问题描述
我得到InvalidOperationException异常>消息=序列中没有匹配的元素用下面的代码:
私人无效buttonLoadHTML_Click(对象发件人,EventArgs五)
{
GetParagraphsListFromHtml(@C:\PlatypiRUs\fitt.html);
}
//此代码改编自柯克沃尔的回答为
http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph -
解析-问题
公开名单<串GT; GetParagraphsListFromHtml(字符串sourceHtml)
{
变种收杆=新的List<串GT;();
HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
的foreach(在doc.DocumentNode
.DescendantNodes VAR杆()
。单(X => x.Id ==身体)
.DescendantNodes()
。凡(X => x.Name ==p))
//.Where(x = GT; x.Name ==H1|| x.Name ==H2 || x.Name ==H3|| x.Name
==HP||))< - 这是我真正喜欢做的事,但我不知道是否
,这是可能的,或者,如果它是,如果语法正确
{
pars.Add(par.InnerText);
}
//测试
的foreach(字符串s在标准杆)
{
MessageBox.Show(S);
}
返回收杆;
}
为什么代码没有找到段落?
我真的想找到的所有的文本(h1..3或更高丘壑,太),但这是一个开始。
BTW:HTML文件,我用的确实的有一些段落元素
更新测试
在回应艾米的暗示请求,并在充分披露/终极照明的利益,这里是整个测试HTML文件:
<风格>
体{
背景色:橙色;
FONT-FAMILY:宋体,无衬线;
}
H1 {
颜色:蓝色;
FONT-FAMILY:'UI的Segoe,宋体,无衬线;
}
H2 {
颜色:白色;
FONT-FAMILY:'帕拉提诺行型活字,帕拉提诺,无衬线;
}
H3 {
显示:inline-block的;
}
< /风格>
< H1>在翻译<值; / H1>
< H2>经典文学与LT的双语版本; / H>
< DIV><标签>联系人:LT; /标签>< A HREF =邮寄地址:axx3andspace@gmail.com>在翻译和LT发现; / A>< / DIV>
< H2>< 80天<环游世界;举> /举>由儒勒·凡尔纳(法语和放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1495308081目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00I0DOYRE目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg/>&下; / A>
< H2><&举GT;格列佛游记< /&举GT;由乔纳森·斯威夫特(英语和放大器;放大器;法语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1495374688目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00I5319ZO目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg/>&下; / A>
< H2><&举GT;西游记地球<的中心; /&举GT;由儒勒·凡尔纳(法语和放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1495409031目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00I6LG25M目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg/>&下; / A>
< H2><&举GT;金银岛< /&举GT;由罗伯特·路易斯·史蒂文森(英语和放大器;放大器;通过边侧芬兰)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1495418936目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00IA5V4KC目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg/>&下; / A>
< H2><&举GT;鲁滨逊漂流记< /&举GT;由丹尼尔·笛福(英语和放大器;放大器;法语并排)LT; / H>
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1495448053目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00I9IE8OY目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg/>&下; / A>
< H2><&举GT;堂吉诃德< /&举GT;由塞万提斯·萨维德拉(西班牙语&放大器;放大器;英语并排)LT; / H>
< H3>平装< / H3和GT;< / BR>
< H3>体积I< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/149474967X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg/>&下; / A>
< H3>卷二< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1494803445目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg/>&下; / A>
< H3>第三卷< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/1494841983目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg/>< / A>< / BR>
< H3>的Kindle< / H3和GT;< / BR>
< H3>体积I< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00HQMWPQ2目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg/>&下; / A>
< H3>卷二< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00HYN2QGM目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg/>&下; / A>
< H3>第三卷< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00HLX519E目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg/>< / A>< / BR>
< H2><&举GT;爱丽丝梦游仙境< /&举GT;刘易斯·卡罗尔(英语和放大器;放大器;并排德方)LT; / H>
< H3>即将推出;现在,请参阅:LT; / H3和GT;< / BR />
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/193659420X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg/>&下; / A>
< H2><&举GT;爱丽丝梦游仙境< /&举GT;刘易斯·卡罗尔(英语和放大器;放大器;并排意方); / H2>其中p
< H3>即将推出;现在,请参阅:LT; / H3和GT;< / BR />
< H3>平装< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/193659420X目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg/>&下; / A>
< H3>的Kindle< / H3 GT&;
< A HREF =http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ目标=_空白>< IMG高度=160WIDTH =107SRC =HTTP ://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg/>&下; / A>
< H2>其他站点:< / H>
< P>< A HREF =http://usamaporama.azurewebsites.net/目标=_空白>美国地图-O-拉玛< / A>< / P>
< P>< A HREF =http://www.awardwinnersonly.com/目标=_空白>屡获殊荣的电影,书籍和音乐< / A>< / P>
< P>< A HREF =http://www.bigsurgarrapata.com/目标=_空白> Garrapata州立公园大苏尔整个季节和LT; / A>< / P>
更新2
这工作(虽然它是活的网页,而不是HTML文件保存到磁盘):
公开名单<串> GetParagraphsListFromHtml(字符串sourceHtml)
{
变种收杆=新的List<串GT;();
HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
变种getHtmlWeb =新HtmlWeb();
变种文件= getHtmlWeb.Load(http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html);
//http://www.bigsurgarrapata.com/只返回了一个段落
// http://usamaporama.azurewebsites.net/< - 无
// HTTP:/ /www.awardwinnersonly.com/< - 同bigsurgarrapata
VAR属性标记= document.DocumentNode.SelectNodes(// p);
INT计数器= 1;
如果(属性标记!= NULL)
{
的foreach(VAR在属性标记PTAG)
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
计数器++;
}
}
MessageBox.Show(完成了!);
返回收杆;
}
原来是相当简单;这是不完整的,但这个由这个答案启发,也足以上手:
HtmlAgilityPack.HtmlDocument HTMLDOC =新HtmlAgilityPack.HtmlDocument();
//有多种选择,根据需要设置
htmlDoc.OptionFixNestedTags = TRUE;
htmlDoc.Load(@C:\Platypus\dplatypus.htm);
如果(htmlDoc.DocumentNode!= NULL)
{
IEnumerable的< HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes(//文本());
的foreach(在textNodes HtmlNode节点)
{
如果(string.IsNullOrWhiteSpace(node.InnerText)!)
{
MessageBox.Show(node.InnerText);
}
}
}
I get "InvalidOperationException > Message=Sequence contains no matching element" with the following code:
private void buttonLoadHTML_Click(object sender, EventArgs e)
{
GetParagraphsListFromHtml(@"C:\PlatypiRUs\fitt.html");
}
// This code adapted from Kirk Woll's answer at
http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph-
parsing-problem
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
foreach (var par in doc.DocumentNode
.DescendantNodes()
.Single(x => x.Id == "body")
.DescendantNodes()
.Where(x => x.Name == "p"))
//.Where(x => x.Name == "h1" || x.Name == "h2" || x.Name == "h3" || x.Name
== "hp" || )) <-- This is what I'd really like to do, but I don't know if
this is possible or, if it is, if the syntax is correct
{
pars.Add(par.InnerText);
}
// test
foreach (string s in pars)
{
MessageBox.Show(s);
}
return pars;
}
Why is the code not finding the paragraphs?
I really want to find all the text (h1..3 or higher vals, too), but this is a start.
BTW: The html file I'm testing with does have some paragraph elements.
UPDATE
In response to Amy's implied request, and in the interest of full disclosure/ultimate illumination, here is the entire test html file:
<style>
body {
background-color: orange;
font-family: Verdana, sans-serif;
}
h1 {
color: Blue;
font-family: 'Segoe UI', Verdana, sans-serif;
}
h2 {
color: white;
font-family: 'Palatino Linotype', 'Palatino', sans-serif;
}
h3 {
display: inline-block;
}
</style>
<h1>Found in the Translation</h1>
<h2>Bilingual Editions of Classic Literature</h2>
<div><label>Contact: </label><a href="mailto:axx3andspace@gmail.com">Found in the Translation</a></div>
<h2><cite>Around the World in 80 Days</cite> by Jules Verne (French & English Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495308081" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I0DOYRE" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h2><cite>Gulliver's Travels</cite> by Jonathan Swift (English & French Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495374688" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I5319ZO" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h2><cite>Journey to the Center of the Earth</cite> by Jules Verne (French & English Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495409031" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I6LG25M" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg" /></a>
<h2><cite>Treasure Island</cite> by Robert Louis Stevenson (English & Finnish Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495418936" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00IA5V4KC" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg" /></a>
<h2><cite>Robinson Crusoe</cite> by Daniel Defoe (English & French Side by Side)</h2>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1495448053" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00I9IE8OY" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg" /></a>
<h2><cite>Don Quixote</cite> by Miguel de Cervantes Saavedra (Spanish & English Side by Side)</h2>
<h3>Paperback</h3></br>
<h3>Volume I</h3>
<a href="http://rads.stackoverflow.com/amzn/click/149474967X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1494803445" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="http://rads.stackoverflow.com/amzn/click/1494841983" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h3>Kindle</h3></br>
<h3>Volume I</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HQMWPQ2" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HYN2QGM" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00HLX519E" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English & German Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/193659420X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>
<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English & Italian Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="http://rads.stackoverflow.com/amzn/click/193659420X" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="http://rads.stackoverflow.com/amzn/click/B00ESLTIYQ" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>
<h2>Other Sites:</h2>
<p><a href="http://usamaporama.azurewebsites.net/" target="_blank">USA Map-O-Rama</a></p>
<p><a href="http://www.awardwinnersonly.com/" target="_blank">Award-winning Movies, Books, and Music</a></p>
<p><a href="http://www.bigsurgarrapata.com/" target="_blank">Garrapata State Park in Big Sur Throughout the Seasons</a></p>
UPDATE 2
This works (although it is with "live" web pages, and not html files saved to disk):
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html");
//http://www.bigsurgarrapata.com/ only returned one paragraph
// http://usamaporama.azurewebsites.net/ <-- none
// http://www.awardwinnersonly.com/ <- same as bigsurgarrapata
var pTags = document.DocumentNode.SelectNodes("//p");
int counter = 1;
if (pTags != null)
{
foreach (var pTag in pTags)
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
counter++;
}
}
MessageBox.Show("done!");
return pars;
}
It turns out to be pretty easy; this is not complete, but this, inspired by this answer, is enough to get started:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(@"C:\Platypus\dplatypus.htm");
if (htmlDoc.DocumentNode != null)
{
IEnumerable<HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HtmlNode node in textNodes)
{
if (!string.IsNullOrWhiteSpace(node.InnerText))
{
MessageBox.Show(node.InnerText);
}
}
}
这篇关于这是为什么HtmlAgilityPack操作无效时,的确都是匹配的元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!