如何使用jsoup解析HTML表格? [英] How to parse HTML table using jsoup?
问题描述
我试图用jsoup解析HTML。这是我第一次使用jsoup,并且还阅读了一些教程。下面是我试图解析的HTML表格 -
如果您看到我的下表,它有三个 tr
截至目前(我已经缩短了它有三个表行只是为了理解目的,但总的来说它会更多)。现在我想从我的下表中提取群集名称
,它对应于主机名称
,例如 - 我会提取 Titan
作为群集名称及其状态为关闭的所有主机名。
正如您在下面看到的<$ c我有两个主机名 machineA.abc.com
和 machineB.abc.com code> machineA
状态是 up
但是 machineB $ c $> c> status
down
。
所以我会打印出 Titan
作为群集名称,并将 machineB.abc.com
作为主机名,因为它已关闭。这可能使用jsoup吗?
< table border = 1>
< tr>
< td>& nbsp;< / td>
< td>& nbsp;< / td>
< td> Alert< / td>
< td>群集名称< / td>
< td> IP地址< / td>
< td>主机名称< / td>
< td>类型< / td>
< td>状态< / td>
< td>免费< / td>
< td>版本< / td>
< td>重新启动时间< / td>
< td> UpTime(天)< / td>
< td>上次探查< / td>
< td>最后一张< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>泰坦< / td>
< td> 10.100.111.77< / td>
< td> machineA.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>< / td>
< td> 10.200.192.99< / td>
< td> machineB.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向下< / td>
< td bgcolor =ffffffalign = right> 85%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:52:20,613< / td>
< td bgcolor =ffffffalign = right> 103< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< / table>
到目前为止,我能够使用jsoup提取整个HTML表格,但不知道如何提取集群名称和主机名称被关闭 -
URL url = new URL(url_name);
Document doc = Jsoup.parse(url,3000);
更新: -
我可能在表中有两个集群名称,如下所示 -
< table border = 1>
< tr>
< td>& nbsp;< / td>
< td>& nbsp;< / td>
< td> Alert< / td>
< td>群集名称< / td>
< td> IP地址< / td>
< td>主机名称< / td>
< td>类型< / td>
< td>状态< / td>
< td>免费< / td>
< td>版本< / td>
< td>重新启动时间< / td>
< td> UpTime(天)< / td>
< td>上次探查< / td>
< td>最后一张< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>泰坦< / td>
< td> 10.100.111.77< / td>
< td> machineA.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td>< / td>
< td> 10.200.192.99< / td>
< td> machineB.abc.com< / td>
< td>< / td>
< td bgcolor =ffffff>向下< / td>
< td bgcolor =ffffffalign = right> 85%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:52:20,613< / td>
< td bgcolor =ffffffalign = right> 103< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< tr bgcolor =ffffff>
< td>< a href = showlog?ip_addr = 127.0.0.1> Hist< / a>< / td>
< td>< a href = http://127.0.0.1:8080 / test?full = y> VI< / a>< / td>
< td bgcolor =ffffff>& nbsp< / td>
< td> Goldy< / td>
< td> 10.100.111.77< / td>
< td> machineH.pqr.com< / td>
< td>< / td>
< td bgcolor =ffffff>向上< / td>
< td bgcolor =ffffffalign = right> 88%< / td>
< td bgcolor =ffffff> 2.0.5-SNAPSHOT< / td>
< td bgcolor =ffffff> 2014-07-04 01:49:08,220< / td>
< td bgcolor =ffffffalign = right> 381< / td>
< td> 07-14 20:01:59< / td>
< td> 07-14 20:01:59< / td>
< / tr>
< / table>
现在如果您在上面看到我有两个集群名称 - 一个是 Titan
和其他是 Goldy
,所以我想找到所有关闭的机器 Titan
集群只有名称。
是的,JSoup有可能。首先,你选择表格。然后,为行选择< tr>
标签。您可以从第二个索引开始,因为第一个行只包含列名。然后遍历< th>
标签并获取特定索引。在你的情况下,索引7和5是重要的(索引7:状态,索引5:主机名)。检查状态是否等于 down
,如果是,则将主机名添加到列表中。这就是全部。
ArrayList< String> downServers = new ArrayList<>();
Element table = doc.select(table)。get(0); //选择第一个表格。
元素行= table.select(tr);
for(int i = 1; i< rows.size(); i ++){//第一行是列名,所以跳过它。
元素行= rows.get(i);
元素cols = row.select(td); (cols.get(7).text()。equals(down)){
downServers.add(cols.get(5).text());
if
$ b 更新:
当您找到单词 Titan
时,您可以创建另一个循环并查看集群名称是否为空。
编辑:循环时,我将循环更改为循环。
ArrayList< String> downServers = new ArrayList<>();
Element table = doc.select(table)。get(0); //选择第一个表格。
元素行= table.select(tr);
for(int i = 1; i< rows.size(); i ++){//第一行是列名,所以跳过它。
元素行= rows.get(i);
元素cols = row.select(td); (cols.get(3).text()。equals(Titan)){
if(cols.get(7).text()。equals(down ))
downServers.add(cols.get(5).text());
do {
if(i< rows.size() - 1)
i ++;
row = rows.get(i);
cols = row.select(td); (cols.get(7).text()。equals(down)&& amp; cols.get(3).text()。equals()){
downServer 。新增(cols.get(5)的.text());
}
if(i == rows.size() - 1)
break; (cols.get(3).text()。equals());
}
while
i--; //如果连续有两个泰坦名字。
code
$ b downServers ArrayList将包含down服务器主机名列表。
I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse -
If you see my below table, it has three tr
as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more). Now I would like to extract Cluster Name
from my below table and it's corresponding host name
so for example - I would extract Titan
as cluster name and all its hostname whose status are down.
As you can see below for Titan
cluster name, I have two hostnames machineA.abc.com
and machineB.abc.com
in which machineA
status is up
but machineB
status is down
.
So I will print out Titan
as cluster name and print out machineB.abc.com
as the hostname since it is down. Is this possible to do using jsoup?
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -
URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);
Update:-
I might have two cluster name in the table as shown below -
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Goldy</td>
<td>10.100.111.77</td>
<td>machineH.pqr.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
Now if you see above I have two cluster name - one is Titan
and other is Goldy
so I want to find all the machines which are down for Titan
cluster name only.
解决方案 Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr>
tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th>
tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down
and if it is, then add the Host Name to a list. That's all.
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(7).text().equals("down")) {
downServers.add(cols.get(5).text());
}
}
Update:
When you find the word Titan
you can create another loop and look if the cluster name is empty.
Edit: I change the while
loop to do while
loop.
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(3).text().equals("Titan")) {
if (cols.get(7).text().equals("down"))
downServers.add(cols.get(5).text());
do {
if(i < rows.size() - 1)
i++;
row = rows.get(i);
cols = row.select("td");
if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
downServers.add(cols.get(5).text());
}
if(i == rows.size() - 1)
break;
}
while (cols.get(3).text().equals(""));
i--; //if there is two Titan names consecutively.
}
}
downServers ArrayList will contain the list of down servers hostnames.
这篇关于如何使用jsoup解析HTML表格?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!