如何使用jsoup解析HTML表格? [英] How to parse HTML table using jsoup?

查看:80
本文介绍了如何使用jsoup解析HTML表格?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用jsoup解析HTML。这是我第一次使用jsoup,并且还阅读了一些教程。下面是我试图解析的HTML表格 -



如果您看到我的下表,它有三个 tr 截至目前(我已经缩短了它有三个表行只是为了理解目的,但总的来说它会更多)。现在我想从我的下表中提取群集名称,它对应于主机名称,例如 - 我会提取 Titan 作为群集名称及其状态为关闭的所有主机名。



正如您在下面看到的<$ c我有两个主机名 machineA.abc.com machineB.abc.com code> machineA 状态是 up 但是 machineB c> status down



所以我会打印出 Titan 作为群集名称,并将 machineB.abc.com 作为主机名,因为它已关闭。这可能使用jsoup吗?

 < table border = 1> 
< tr>
< td>& nbsp;< / td>
< td>& nbsp;< / td>
< td> Alert< / td>
< td>群集名称< / td>
< td> IP地址< / td>
< td>主机名称< / td>
< td>类型< / td>
< td>状态< / td>
< td>免费< / td>
< td>版本< / td>
< td>重新启动时间< / td>
< td> UpTime(天)< / td>
< td>上次探查< / td>
< td>最后一张< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>泰坦< / td>
< td> 10.100.111.77< / td>
< td> machineA.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>< / td>
< td> 10.200.192.99< / td>
< td> machineB.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向下< / td>
< td bgcolor =ffffffalign = right> 85%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:52:20,613< / td>
< td bgcolor =ffffffalign = right> 103< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< / table>

到目前为止,我能够使用jsoup提取整个HTML表格,但不知道如何提取集群名称和主机名称被关闭 -

  URL url = new URL(url_name); 
Document doc = Jsoup.parse(url,3000);

更新: -



我可能在表中有两个集群名称,如下所示 -

 < table border = 1> 
< tr>
< td>& nbsp;< / td>
< td>& nbsp;< / td>
< td> Alert< / td>
< td>群集名称< / td>
< td> IP地址< / td>
< td>主机名称< / td>
< td>类型< / td>
< td>状态< / td>
< td>免费< / td>
< td>版本< / td>
< td>重新启动时间< / td>
< td> UpTime(天)< / td>
< td>上次探查< / td>
< td>最后一张< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>泰坦< / td>
< td> 10.100.111.77< / td>
< td> machineA.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>< / td>
< td> 10.200.192.99< / td>
< td> machineB.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向下< / td>
< td bgcolor =ffffffalign = right> 85%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:52:20,613< / td>
< td bgcolor =ffffffalign = right> 103< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td> Goldy< / td>
< td> 10.100.111.77< / td>
< td> machineH.pqr.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< / table>

现在如果您在上面看到我有两个集群名称 - 一个是 Titan 和其他是 Goldy ,所以我想找到所有关闭的机器 Titan 集群只有名称。

解决方案

是的,JSoup有可能。首先,你选择表格。然后,为行选择< tr> 标签。您可以从第二个索引开始,因为第一个行只包含列名。然后遍历< th> 标签并获取特定索引。在你的情况下,索引7和5是重要的(索引7:状态,索引5:主机名)。检查状态是否等于 down ,如果是,则将主机名添加到列表中。这就是全部。

  ArrayList< String> downServers = new ArrayList<>(); 
Element table = doc.select(table)。get(0); //选择第一个表格。
元素行= table.select(tr);

for(int i = 1; i< rows.size(); i ++){//第一行是列名,所以跳过它。
元素行= rows.get(i);
元素cols = row.select(td); (cols.get(7).text()。equals(down)){
downServers.add(cols.get(5).text());

if





$ b

更新:
当您找到单词 Titan 时,您可以创建另一个循环并查看集群名称是否为空。



编辑:循环时,我将循环更改为循环。

  ArrayList< String> downServers = new ArrayList<>(); 
Element table = doc.select(table)。get(0); //选择第一个表格。
元素行= table.select(tr);

for(int i = 1; i< rows.size(); i ++){//第一行是列名,所以跳过它。
元素行= rows.get(i);
元素cols = row.select(td); (cols.get(3).text()。equals(Titan)){
if(cols.get(7).text()。equals(down ))
downServers.add(cols.get(5).text());

do {
if(i< rows.size() - 1)
i ++;
row = rows.get(i);
cols = row.select(td); (cols.get(7).text()。equals(down)&& amp; cols.get(3).text()。equals()){
downServer 。新增(cols.get(5)的.text());
}
if(i == rows.size() - 1)
break; (cols.get(3).text()。equals());
}
while
i--; //如果连续有两个泰坦名字。


code


$ b

downServers ArrayList将包含down服务器主机名列表。

I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse -

If you see my below table, it has three tr as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more). Now I would like to extract Cluster Name from my below table and it's corresponding host name so for example - I would extract Titan as cluster name and all its hostname whose status are down.

As you can see below for Titan cluster name, I have two hostnames machineA.abc.com and machineB.abc.com in which machineA status is up but machineB status is down.

So I will print out Titan as cluster name and print out machineB.abc.com as the hostname since it is down. Is this possible to do using jsoup?

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
</table>

So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -

URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);

Update:-

I might have two cluster name in the table as shown below -

<table border=1>
   <tr>
      <td>&nbsp;</td>
      <td>&nbsp;</td>
      <td>Alert</td>
      <td>Cluster Name</td>
      <td>IP addr</td>
      <td>Host Name</td>
      <td>Type</td>
      <td>Status</td>
      <td>Free</td>
      <td>Version</td>
      <td>Restart Time</td>
      <td>UpTime(Days)</td>
      <td>Last probed</td>
      <td>Last up</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Titan</td>
      <td>10.100.111.77</td>
      <td>machineA.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td></td>
      <td>10.200.192.99</td>
      <td>machineB.abc.com</td>
      <td></td>
      <td bgcolor="ffffff">down</td>
      <td bgcolor="ffffff" align=right>85%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
      <td bgcolor="ffffff" align=right>103</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>
   <tr bgcolor="ffffff">
      <td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
      <td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
      <td bgcolor="ffffff">&nbsp</td>
      <td>Goldy</td>
      <td>10.100.111.77</td>
      <td>machineH.pqr.com</td>
      <td></td>
      <td bgcolor="ffffff">up</td>
      <td bgcolor="ffffff" align=right>88%</td>
      <td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
      <td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
      <td bgcolor="ffffff" align=right>381</td>
      <td>07-14 20:01:59</td>
      <td>07-14 20:01:59</td>
   </tr>       
</table>

Now if you see above I have two cluster name - one is Titan and other is Goldy so I want to find all the machines which are down for Titan cluster name only.

解决方案

Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr> tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th> tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down and if it is, then add the Host Name to a list. That's all.

ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");

for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
    Element row = rows.get(i);
    Elements cols = row.select("td");

    if (cols.get(7).text().equals("down")) {
        downServers.add(cols.get(5).text());
    }
}

Update: When you find the word Titan you can create another loop and look if the cluster name is empty.

Edit: I change the while loop to do while loop.

    ArrayList<String> downServers = new ArrayList<>();
    Element table = doc.select("table").get(0); //select the first table.
    Elements rows = table.select("tr");

    for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
        Element row = rows.get(i);
        Elements cols = row.select("td");

        if (cols.get(3).text().equals("Titan")) {
            if (cols.get(7).text().equals("down"))
                downServers.add(cols.get(5).text());

            do {
                if(i < rows.size() - 1)
                   i++;
                row = rows.get(i);
                cols = row.select("td");
                if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
                    downServers.add(cols.get(5).text());
                }
                if(i == rows.size() - 1)
                    break;
            }
            while (cols.get(3).text().equals(""));
            i--; //if there is two Titan names consecutively.
        }
    }

downServers ArrayList will contain the list of down servers hostnames.

这篇关于如何使用jsoup解析HTML表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆