在HTML Agility Pack中遍历多个HTML表格 [英] Loop thorough multiple HTML tables in HTML Agility Pack

查看:113
本文介绍了在HTML Agility Pack中遍历多个HTML表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面的链接中使用了该示例,并且能够成功将HTML表解析为一个数据表。 $ b

http://blog.ditran.net/parsing-html-table-to-c-usable-datalist/ < a>



但是我无法解析多个表,当我遍历TR时,第一个TR总是有列名,其余的是每个表中的数据。所以我使用这个逻辑并将表数据存储在字典中并发送到我的ToDataTable函数中。

有人可以帮助我如何循环遍历多个表并实现相同的logic.Appreciate it。

  var tRowList = doc.DocumentNode.SelectNodes(// tr); 
foreach(tRowList中的HtmlNode tRow)
{
if(previousRowSpanList.Count> 0)
{
theDict = previousRowSpanList [0];
previousRowSpanList.Remove(theDict); //将其从列表中移除
isWorkingWithRowSpan = true;
}
else
{
theDict = new List< KeyValuePair< string,string>>();
isWorkingWithRowSpan = false;
}
var tCellList = tRow.SelectNodes(td | th);
tCelCount = tCellList.Count; (tCelCount> 0&&
!(tCe​​lCount == 1&& string.IsNullOrEmpty(tCellList [0] .InnerText.Trim()))

{
// colOrder = 1;
IsNullEntireRow = true;
for(int colIndex = 0; colIndex< tCelCount; colIndex ++)
{
cell = tCellList [colIndex];
ColInnerText = cell.InnerText.Replace(& nbsp;,).Trim();
if(!string.IsNullOrEmpty(ColInnerText))
IsNullEntireRow = false;

//

  static DataTable ToDataTable(List< List< KeyValuePair< string,string>>> list)
{
DataTable result = new DataTable();
if(list.Count == 0)
返回结果;

result.Columns.AddRange(
list.First()。Select(r => new DataColumn(r.Value))。ToArray()
);



list = list.Skip(1).ToArray()。ToList();
list.ForEach(r => result.Rows.Add(r.Select(c => c.Value).Cast< object>()。ToArray()));


返回结果;

示例HTML:

 <表> 
< tbody>
< tr>< td style =background-color:#A9F5A9; font-weight:bold; class =center>节点< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> Logtime< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center>硬件< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> Prcstate A< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> Prcstate B< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center>群集< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> RAID< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> AD复制A< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> AD复制B< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center>文件复制A< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center>文件复制B< / td>< td style =background-color:#A9F5A9; font-weight:bold; class =center> hcstart RESULT< / td>< / tr>
< tr>< td class =center> DTMSCB1< / td>< td class =center> 2016-08-26 16:40< / td>< td class = center> APG43L< / td>< td class =center> active< / td>< td class =center> passive< / td>< td class =center> - < / td>< td class =center> - < / td>< td class =center> - < / td>< td class =center>< ; / td>< td class =center> - < / td>< td class =center> - < / td>< td style =background-color:#FF0000; color :#FFFFFF;字型重量:粗体; class =center>不正确< / td>< / tr>
< tr>< td class =center> MSC9< / td>< td class =center> 2016-08-26 16:40< / td>< td class = 中心> APG40C / 4< / td>< td class =center>被动< / td>< td class =center> active< / td>< td class = OK< / td>< td class =center> OK< / td>< td class =center> OK< / td>< td class =center> td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center>不正确< / td>< td class =center>确定< / td>< td class =center> - < / td>< / tr>
< / tbody>
< / table>


< table>
< tbody>
< tr>< td style =background-color:#A9F5A9; class =center>节点类型< / td>< td style =background-color:#A9F5A9; class =center>节点< / td>< td style =background-color:#A9F5A9; class =center>日志时间< / td>< td style =background-color:#A9F5A9; class =center> New Mon. Alarms< / td>< td style =background-color:#A9F5A9;类=中心>周一Alarms Total< / td>< td style =background-color:#A9F5A9; class =center>其他警报< / td>< td style =background-color:#A9F5A9;类= 中心 > MML< / TD>< / TR>
< tr>< td class =center> BSC< / td>< td class =center> BMBSC1< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 46< / td>< td class =center> 445< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> BSC< / td>< td class =center> BMBSC2C< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 27< / td>< td class =center> 609< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> BSC< / td>< td class =center> CYBSC1< / td>< td class = 08-26 16:45< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 1< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 45< / td>< td class =center> 665< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> BSC< / td>< td class =center> CYBSC2C< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 30< / td>< td class =center> 849< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> MSC-BC< / td>< td class =center> CYMSCB1< / td>< td class =center> 2016-08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold ; class =center> 38< / td>< td class =center> 283< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> BSC< / td>< td class =center> DTBSC1< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 48< / td>< td class =center> 201< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> BSC< / td>< td class =center> DTBSC2< / td>< td class = 08-26 16:45< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 1< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 31< / td>< td class =center> 310< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> MSC-BC< / td>< td class =center> DTMSCB1< / td>< td class =center> 2016-08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold ; class =center> 25< / td>< td class =center> 130< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> HLR< / td>< td class =center> HLR1< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 16< / td>< td class =center> 12< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> HLR< / td>< td class =center> HLR2< / td>< td class = 08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold; class =center> 24< / td>< td class =center> 10< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> MSC-S< / td>< td class =center> MSC10< / td>< td class =center> 2016-08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold ; class =center> 48< / td>< td class =center> 79< / td>< td class =center> OK< / td>< / tr>
< tr>< td class =center> MSC-S< / td>< td class =center> MSC9< / td>< td class =center> 2016-08-26 16:45< / td>< td class =center> 0< / td>< td style =background-color:#FF0000; color:#FFFFFF; font-weight:bold ; class =center> 46< / td>< td class =center> 131< / td>< td class =center> OK< / td>< / tr>
< / tbody>
< / table>


解决方案

我会保留第一个答案以供参考,但下面是一个方法,它将原始html分成一个字符串数组,每个字符串元素包含一个表的HTML:

  public static string [] ParseHtmlSplitTables(string htmlString)
{
string [] result = new string [] {}; $!
$ b if(!String.IsNullOrWhiteSpace(htmlString))
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);

var tableNodes = doc.DocumentNode.SelectNodes(// table);
if(tableNodes!= null)
{
result = Array.ConvertAll< HtmlNode,string>(tableNodes.ToArray(),n => n.OuterHtml);
}
}

返回结果;
}

结果您可以继续解析每个表格:

  string [] htmlTables = ParseHtmlSplitTables(htmlString); 

foreach(html表格中的字符串html)
{
List< List< KeyValuePair< string,string>>> parseResult = ParseHtmlToDataTable(html);

DataTable dataTable = ToDataTable(parseResult);
}


I followed the example in the below link and was able to parse HTML table successfully to a datatable.

http://blog.ditran.net/parsing-html-table-to-c-usable-datalist/

But I am not able to parse multiple tables,When I traverse through TR the first TR always have the column names and the rest have the data in each table.So I am using this logic and storing the table data in dictionary and sending to my ToDataTable function.

Can someone help on how can I loop thoriugh multiple tables and implement the same logic.Appreciate it.

var tRowList = doc.DocumentNode.SelectNodes("//tr");
foreach (HtmlNode tRow in tRowList)
                    {
                        if (previousRowSpanList.Count > 0)
                        {
                            theDict = previousRowSpanList[0];
                            previousRowSpanList.Remove(theDict);        //remove it off the list
                            isWorkingWithRowSpan = true;
                        }
                        else
                        {
                            theDict = new List<KeyValuePair<string, string>>();
                            isWorkingWithRowSpan = false;
                        }
                        var tCellList = tRow.SelectNodes("td|th");
                        tCelCount = tCellList.Count;
                        if (tCelCount > 0 &&
                        !(tCelCount == 1 && string.IsNullOrEmpty(tCellList[0].InnerText.Trim()))
                        )
                        {
                            //colOrder = 1;
                            IsNullEntireRow = true;
                            for (int colIndex = 0; colIndex < tCelCount; colIndex++)
                            {
                                cell = tCellList[colIndex];
                                ColInnerText = cell.InnerText.Replace("&nbsp;", " ").Trim();
                                if (!string.IsNullOrEmpty(ColInnerText))
                                    IsNullEntireRow = false;

//

 static DataTable ToDataTable(List<List<KeyValuePair<string, string>>> list)
        {
            DataTable result = new DataTable();
            if (list.Count == 0)
                return result;

            result.Columns.AddRange(
        list.First().Select(r => new DataColumn(r.Value)).ToArray()
    );



            list= list.Skip(1).ToArray().ToList();
            list.ForEach(r => result.Rows.Add(r.Select(c => c.Value).Cast<object>().ToArray()));


            return result;

sample HTML:

<table>
<tbody>
<tr><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Node</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Logtime</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Hardware</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Prcstate A</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Prcstate B</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">Cluster</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">RAID</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">AD replication A</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">AD replication B</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">File replication A</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">File replication B</td><td style="background-color:#A9F5A9;font-weight:bold;" class="center">hcstart RESULT</td></tr>
<tr><td class="center">DTMSCB1</td><td class="center">2016-08-26 16:40</td><td class="center">APG43L</td><td class="center">active</td><td class="center">passive</td><td class="center">-</td><td class="center">-</td><td class="center">-</td><td class="center">-</td><td class="center">-</td><td class="center">-</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">Not OK</td></tr>
<tr><td class="center">MSC9</td><td class="center">2016-08-26 16:40</td><td class="center">APG40C/4</td><td class="center">passive</td><td class="center">active</td><td class="center">OK</td><td class="center">OK</td><td class="center">OK</td><td class="center">OK</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">Not OK</td><td class="center">OK</td><td class="center">-</td></tr>
</tbody>
</table>


<table>
<tbody>
<tr><td style="background-color:#A9F5A9;" class="center">Node Type</td><td style="background-color:#A9F5A9;" class="center">Node</td><td style="background-color:#A9F5A9;" class="center">Log Time</td><td style="background-color:#A9F5A9;" class="center">New Mon. Alarms</td><td style="background-color:#A9F5A9;" class="center">Mon. Alarms Total</td><td style="background-color:#A9F5A9;" class="center">Other Alarms</td><td style="background-color:#A9F5A9;" class="center">MML</td></tr>
<tr><td class="center">BSC</td><td class="center">BMBSC1</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">46</td><td class="center">445</td><td class="center">OK</td></tr>
<tr><td class="center">BSC</td><td class="center">BMBSC2C</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">27</td><td class="center">609</td><td class="center">OK</td></tr>
<tr><td class="center">BSC</td><td class="center">CYBSC1</td><td class="center">2016-08-26 16:45</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">1</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">45</td><td class="center">665</td><td class="center">OK</td></tr>
<tr><td class="center">BSC</td><td class="center">CYBSC2C</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">30</td><td class="center">849</td><td class="center">OK</td></tr>
<tr><td class="center">MSC-BC</td><td class="center">CYMSCB1</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">38</td><td class="center">283</td><td class="center">OK</td></tr>
<tr><td class="center">BSC</td><td class="center">DTBSC1</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">48</td><td class="center">201</td><td class="center">OK</td></tr>
<tr><td class="center">BSC</td><td class="center">DTBSC2</td><td class="center">2016-08-26 16:45</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">1</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">31</td><td class="center">310</td><td class="center">OK</td></tr>
<tr><td class="center">MSC-BC</td><td class="center">DTMSCB1</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">25</td><td class="center">130</td><td class="center">OK</td></tr>
<tr><td class="center">HLR</td><td class="center">HLR1</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">16</td><td class="center">12</td><td class="center">OK</td></tr>
<tr><td class="center">HLR</td><td class="center">HLR2</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">24</td><td class="center">10</td><td class="center">OK</td></tr>
<tr><td class="center">MSC-S</td><td class="center">MSC10</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">48</td><td class="center">79</td><td class="center">OK</td></tr>
<tr><td class="center">MSC-S</td><td class="center">MSC9</td><td class="center">2016-08-26 16:45</td><td class="center">0</td><td style="background-color:#FF0000;color:#FFFFFF;font-weight:bold;" class="center">46</td><td class="center">131</td><td class="center">OK</td></tr>
</tbody>
</table>

解决方案

I'll keep the first answer for reference, but below is a method that will split the original html into a string array with each string element containing the HTML for one table:

public static string[] ParseHtmlSplitTables(string htmlString)
{
    string[] result = new string[] { };

    if (!String.IsNullOrWhiteSpace(htmlString))
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(htmlString);

        var tableNodes = doc.DocumentNode.SelectNodes("//table");
        if (tableNodes != null)
        {
            result = Array.ConvertAll<HtmlNode, string>(tableNodes.ToArray(), n => n.OuterHtml);
        }
    }

    return result;
}

With the result you can then proceed to parse each table:

string[] htmlTables = ParseHtmlSplitTables(htmlString);

foreach (string html in htmlTables)
{
    List<List<KeyValuePair<string, string>>> parseResult = ParseHtmlToDataTable(html);

    DataTable dataTable = ToDataTable(parseResult);
}

这篇关于在HTML Agility Pack中遍历多个HTML表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆