提取数据出来表与JSoup [英] Extract Data out of table with JSoup

查看:184
本文介绍了提取数据出来表与JSoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取该表与JSoup框架的内容保存在一个表-array。第一TR-tag为表头。所有以下(不包括)描述的内容。

I want to extract this table with the JSoup-framework to save the content in a "table"-array. The first tr-tag is the table header. All followings (not included) describe the content.

<table style=h2 width=100% cellspacing="0" cellpadding="4" border="1" bgColor="#FFFFFF">
<tr>
<td align="left" bgcolor="#9999FF" >
<!-- 0 -->
Kl.
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 3 -->
Std.
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 4 -->
Lehrer
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 5 -->
Fach
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 6 -->
Raum
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 7 -->
VLehrer
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 8 -->
VFach
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 9 -->
VRaum
</td>
<td align="left" bgcolor="#9999FF" >
<!-- 13 -->
Info
</td>
</tr>
<tr>
<!-- 1 0 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 3 -->
<td align="left" bgcolor="#FFFFFF" >
4
</td>
<!-- 1 4 -->
<td align="left" bgcolor="#FFFFFF" >
Méta
</td>
<!-- 1 5 -->
<td align="left" bgcolor="#FFFFFF" >
HU
</td>
<!-- 1 6 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 7 -->
<td align="left" bgcolor="#FFFFFF" >
Shne
</td>
<!-- 1 8 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 9 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
<!-- 1 13 -->
<td align="left" bgcolor="#FFFFFF" >
&nbsp;
</td>
</tr>

我已经测试这一个和其他一些人,但我并没有到达他们为我工作:
使用JSoup来提取HTML目录

推荐答案

下面是一些例子code你如何选择只有标题:

Here's some example code how you can select only the header:

Element tableHeader = doc.select("tr").first();


for( Element element : tableHeader.children() )
{
    // Here you can do something with each element
    System.out.println(element.text());
}

您获得文件按...


  1. 解析的文件文档的DOC = Jsoup.parse(F,NULL); (其中 ˚F文件的字符集,请参阅jsoup对铁道部的相关信息文档)

  1. parsing a file: Document doc = Jsoup.parse(f, null); (where f is the File and null the charset, please see jsoup documentation for mor infos)

解析网站文档的DOC = Jsoup.connect(http://your.url.here)获得(); (千万不要错过的http://

parsing a website: Document doc = Jsoup.connect("http://your.url.here").get(); (don't miss the http://)

输出:

Kl.
Std.
Lehrer
Fach
Raum
VLehrer
VFach
VRaum
Info


现在,如果你需要一个数组(或更好列表),您可以创建每个条目的所有信息存储在一个新的类的所有条目。接下来,您通过jsoup解析HTML和填充类的所有领域,以及将它添加到列表中。


Now, if you need an array (or better List) of all entries you can create a new class where all informations of each entry is stored. Next you parse the Html via jsoup and fill all fields of the class as well as adding it to list.

// Note: all values are strings - you'll need to use better types (int, enum whatever) here. But for an example its enough.
public class Entry
{
    private String klasse;
    private String stunde;
    private String lehrer;
    private String fach;
    private String raum;
    private String vLehrer;
    private String vFach;
    private String vRaum;
    private String info;


    // constructor(s) and getter / setter

    /*
     * Btw. it's a good idea using two constructors here: one with all arguments and one empty. So you can create a new instance without knowing any data and add it with setter-methods afterwards.
     */
}

下一步code至极罢了,你的条目(包括它们的存储位置列表。)

Next the code wich fills your entry (incl. the list where they are stored):

List<Entry> entries = new ArrayList<>();        // All entries are saved here
boolean firstSkipped = false;                   // Used to skip first 'tr' tag


for( Element element : doc.select("tr") )       // Select all 'tr' tags from document
{
     // Skip the first 'tr' tag since it's the header
    if( !firstSkipped )
    {
        firstSkipped = true;
        continue;
    }

    int index = 0;                              // Instead of index you can use 0, 1, 2, ...
    Entry tableEntry = new Entry();
    Elements td = element.select("td");         // Select all 'td' tags of the 'tr'

    // Fill your entry
    tableEntry.setKlasse(td.get(index++).text());
    tableEntry.setStunde(td.get(index++).text());
    tableEntry.setLehrer(td.get(index++).text());
    tableEntry.setFach(td.get(index++).text());
    tableEntry.setRaum(td.get(index++).text());
    tableEntry.setvLehrer(td.get(index++).text());
    tableEntry.setvFach(td.get(index++).text());
    tableEntry.setInfo(td.get(index++).text());

    entries.add(tableEntry);                    // Finally add it to the list
}

如果您从第一篇文章使用的HTML,你会得到这样的输出:

If you use your html from the first post you'll get this output:

[Entry{klasse= , stunde=4, lehrer=Méta, fach=HU, raum= , vLehrer=Shne, vFach= , vRaum=null, info= }]

注意:我只是用的System.out.println(项); 为。所以输出的格式是由的toString() 输入方法。

Note: I simply used System.out.println(entries); for that. So the format of the output is from the toString() Method of Entry.

请参见 Jsoup文档,尤其是一个的 jsoup选择API

Please see Jsoup documentation and especially the one for jsoup selector api.

这篇关于提取数据出来表与JSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆