用HTMLAGILITYPACK解析HTML并加载到数据表C# [英] parsing html with HTMLAGILITYPACK and loading into datatable C#

查看:123
本文介绍了用HTMLAGILITYPACK解析HTML并加载到数据表C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 < body class =style_0> 
< div>
< div class =style_1>待定测试列表< / div>
< col>
< col>
< tbody>
< tr>
< td style =vertical-align:baseline;>
< div class =style_4>待定测试列表< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_5>一些代理实验室公司< / div>
< / td>
< / tr>
< / tbody>
< / table>
< table class =style_6style =width:4.531in; ID = AUTOGENBOOKMARK_5083738604442918131 >
< col style =width:1in;>
< col class =style_7style =width:0.75in;>
< col class =style_8style =width:0.6in;>
< col style =width:0.75in;>
< col style =width:2.375in;>
< tbody>
< tr class =style_9style =height:0.5in;>
< td style =vertical-align:middle;>
< div class =style_10>报告范围:< / div>
< / td>
< td style =vertical-align:middle;>
< div class =style_11> 01/01/2012< / div>
< / td>
< td style =vertical-align:middle;>
< div class =style_12>到< / div>
< / td>
< td style =vertical-align:middle;>
< div class =style_13> 01/31/2012< / div>
< / td>
< td style =vertical-align:middle;>
< div class =style_14>(按输入日期)< / div>
< / td>
< / tr>
< / tbody>
< / table>

(QuRs78576248:0)>
< col style =width:0.75in;>
< col style =width:1.25in;>
< col style =width:1in;>
< col style =width:1.5in;>
< col style =width:1.5in;>
< col style =width:1.5in;>
< col>
< thead>
< tr>
< td colspan =4style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< / tr>
< tr>
< td style =vertical-align:baseline;>
< div class =style_16>已输入< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>规格ID< / DIV>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>批次/位置< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>测试< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>客户端ID< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>客户名称< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_16>代理商< / div>
< / td>
< / tr>
< / thead>
< tbody>
< tr>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 1/30/12< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_19> ZZ324sdf< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 51446/75< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> HOLD_DE< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> 234234< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> smith,john< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> PPPM-6P - 部分代理< / div>
< / td>
< / tr>
< tr>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 1/31/12< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_19> SFD3434< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 51668/17< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> HOLD_DE< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> FOY,EL< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> FOY,ALEX< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> someagency& amp; amp; amp; amp; amp; amp; amp; amp; Associates LLC< / div>
< / td>
< / tr>
< tr>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 1/31/12< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_19> SFD3434< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 51668/25< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> HOLD_DE< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> JAMISON,PA< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> JAMISON,ROY< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> someagency& amp; amp; amp; amp; amp; amp; amp; amp; Associates LLC< / div>
< / td>
< / tr>
< tr>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 1/31/12< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_19> SFD3434< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_18> 51669/34< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> HOLD_DE< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> NEWMAN,SO< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> NEWMAN,ALEX< / div>
< / td>
< td class =style_17style =vertical-align:baseline;>
< div class =style_20> someagency& amp; amp; amp; amp; amp; amp; amp; amp; Associates LLC< / div>
< / td>
< / tr>
< / tbody>
< tfoot>
< tr>
< td colspan =2style =vertical-align:baseline;>
< div class =style_21> Total Tests:< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_22> 4< / div>
< / td>
< td style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< td style =vertical-align:baseline;>< / td>
< / tr>
< / tfoot>
< / table>
< col>
< col>
< col>
< tbody>
< tr>
< td style =vertical-align:baseline;>
< div class =style_2>
< br> 2012年2月13日上午9:37< / div>
< / td>
< td style =vertical-align:baseline;>
< div class =style_3>
< br>
< div style =text-align:center;>第1页< / div>
< / div>
< / td>
< td style =vertical-align:baseline;>< / td>
< / tr>
< / tbody>
< / table>
< / div>
< / body>

呈现时看起来像这样:





这里是我想解析的数据:


1/30/12 ZZ324sdf 51446/75 HOLD_DE 234234 smith,john PPPM-6P - 一些
代理机构2012年1月31日SFD3434 51668/17 HOLD_DE FOY,EL FOY,ALEX someagency
& Associates LLC 1/31/12 SFD3434 51668/25 HOLD_DE JAMISON,PA
JAMISON,ROY someagency& Associates LLC 1/31/12 SFD3434 51669/34
HOLD_DE NEWMAN,SO NEWMAN,ALEX someagency& Associates LLC


到目前为止我已经尝试过:

  foreach(在htmlSnippet.DocumentNode.SelectNodes(// a [@href]))中的HtmlNode链接)
{
HtmlAttribute att = link.Attributes [href];
hrefTags.Add(att.Value);
}

但我明白这只会提取AHREF,而我想提取表元素。



我该怎么做?感谢你的帮助。

解决方案

想想它略有不同 - 而不是想要每个锚点(用href) ,你需要表格正文中的每一行都带有类 style_15 (该id看起来非常生动);那么,对于每一行,您都需要每个单元格。

  foreach(在htmlSnippet.DocumentNode.SelectNodes(/ / table [@class ='style_15'] / tbody / tr))
{
foreach(var cell in row.SelectNodes(td))
{
/ /做某事
}
}


I have HTML that looks like this:

<body class="style_0">
        <div>
            <div class="style_1">Pending Test List</div>
            <table style=" width: 100%;" id="AUTOGENBOOKMARK_4365445353431356880">
                <col>
                <col>
                <tbody>
                    <tr>
                        <td style="vertical-align: baseline;">
                            <div class="style_4">Pending Test List</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_5">SOME AGENCY Laboratories, Inc.</div>
                        </td>
                    </tr>
                </tbody>
            </table>
            <table class="style_6" style=" width: 4.531in;" id="AUTOGENBOOKMARK_5083738604442918131">
                <col style=" width: 1in;">
                <col class="style_7" style=" width: 0.75in;">
                <col class="style_8" style=" width: 0.6in;">
                <col style=" width: 0.75in;">
                <col style=" width: 2.375in;">
                <tbody>
                    <tr class="style_9" style=" height: 0.5in;">
                        <td style="vertical-align: middle;">
                            <div class="style_10">Report Range:</div>
                        </td>
                        <td style="vertical-align: middle;">
                            <div class="style_11">01/01/2012</div>
                        </td>
                        <td style="vertical-align: middle;">
                            <div class="style_12">through</div>
                        </td>
                        <td style="vertical-align: middle;">
                            <div class="style_13">01/31/2012</div>
                        </td>
                        <td style="vertical-align: middle;">
                            <div class="style_14">(by Date Entered)</div>
                        </td>
                    </tr>
                </tbody>
            </table>
            <table class="style_15" style=" width: 100%;" id="AUTOGENBOOKMARK_7602283385844673591" iid="/526

(QuRs78576248:0)">
                <col style=" width: 0.75in;">
                <col style=" width: 1.25in;">
                <col style=" width: 1in;">
                <col style=" width: 1.5in;">
                <col style=" width: 1.5in;">
                <col style=" width: 1.5in;">
                <col>
                <thead>
                    <tr>
                        <td colspan="4" style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                    </tr>
                    <tr>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Entered</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Spec. ID</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Batch/Pos.</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Test</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Client ID</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Client Name</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_16">Agency</div>
                        </td>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">1/30/12</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_19">ZZ324sdf</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">51446 / 75</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">HOLD_DE</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">234234</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">smith, john</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">PPPM-6P - SOME AGENCY</div>
                        </td>
                    </tr>
                    <tr>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">1/31/12</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_19">SFD3434</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">51668 / 17</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">HOLD_DE</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">FOY, EL</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">FOY, ALEX</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">someagency &amp; Associates LLC</div>
                        </td>
                    </tr>
                    <tr>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">1/31/12</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_19">SFD3434</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">51668 / 25</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">HOLD_DE</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">JAMISON, PA</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">JAMISON, ROY</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">someagency &amp; Associates LLC</div>
                        </td>
                    </tr>
                    <tr>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">1/31/12</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_19">SFD3434</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_18">51669 / 34</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">HOLD_DE</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">NEWMAN, SO</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">NEWMAN, ALEX</div>
                        </td>
                        <td class="style_17" style="vertical-align: baseline;">
                            <div class="style_20">someagency &amp; Associates LLC</div>
                        </td>
                    </tr>
                </tbody>
                <tfoot>
                    <tr>
                        <td colspan="2" style="vertical-align: baseline;">
                            <div class="style_21">Total Tests:</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_22">4</div>
                        </td>
                        <td style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                        <td style="vertical-align: baseline;"></td>
                    </tr>
                </tfoot>
            </table>
            <table style=" width: 100%;" id="AUTOGENBOOKMARK_8507236727661888074">
                <col>
                <col>
                <col>
                <tbody>
                    <tr>
                        <td style="vertical-align: baseline;">
                            <div class="style_2">
                                <br>Feb 13, 2012 9:37 AM</div>
                        </td>
                        <td style="vertical-align: baseline;">
                            <div class="style_3">
                                <br>
                                <div style="text-align:center;">Page 1</div>
                            </div>
                        </td>
                        <td style="vertical-align: baseline;"></td>
                    </tr>
                </tbody>
            </table>
        </div>
    </body>

when rendered it looks something like this:

here is the data that I wanted to parse out of there:

1/30/12 ZZ324sdf 51446 / 75 HOLD_DE 234234 smith, john PPPM-6P - SOME AGENCY 1/31/12 SFD3434 51668 / 17 HOLD_DE FOY, EL FOY, ALEX someagency & Associates LLC 1/31/12 SFD3434 51668 / 25 HOLD_DE JAMISON, PA JAMISON, ROY someagency & Associates LLC 1/31/12 SFD3434 51669 / 34 HOLD_DE NEWMAN, SO NEWMAN, ALEX someagency & Associates LLC

so far I have tried:

foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

but I understand this will extract only the AHREF, and I want to extract the table elements.

how do I do this? Thank you so much for your help.

解决方案

Think of it slightly differently -- instead of wanting every anchor (with a href), You want every row from the body of the table with class style_15 (that id looks very generated on the fly); then, for every row, you'll want every cell.

foreach (var row in htmlSnippet.DocumentNode.SelectNodes("//table[@class = 'style_15']/tbody/tr"))
{
    foreach (var cell in row.SelectNodes("td"))
    {
        // Do something
    }
}

这篇关于用HTMLAGILITYPACK解析HTML并加载到数据表C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆