使用Java将复杂HTML表中的数据提取到2d数组 [英] Extract data from complex HTML tables to 2d array in Java

查看:117
本文介绍了使用Java将复杂HTML表中的数据提取到2d数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将HTML表格与colspan和rowspan 转换为Java中的2d数组(martix)?

How to convert HTML tables with colspan and rowspan into 2d array (martix) in Java?

我在Python中找到了很好的解决方案和jQuery但不是Java(只有非常简单的表通过jsoup)。 XSLT有一个漂亮的解决方案,但由于格式错误的输入HTML文件,我不适合。

I have found nice solutions in Python and jQuery but not in Java (only very simple tables via jsoup). There is one pretty solution with XSLT but due malformed input HTML files it is not OK for me.

输入表示例:

  <body>
    <table border="1">
        <tr><td>H1</td><td colspan="2">H2</td><tr>
        <tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>
       <tr><td rowspan="3">A1</td><td>B1</td><td rowspan="2">C1</td></tr>
       <tr><td rowspan="2">B2</td></tr>
       <tr><td>C3</td></tr>
       <tr><td>C4</td><td>C5</td><td>C6</td></tr>
        <tr><td>D7</td><td colspan="2">D9</td></tr>
        <tr><td  colspan="3">Notes</td></tr>
   </table>
</body>

期望输出:

    [['H1', 'H2', 'H2'],
     ['', 'SubH2_1', 'SubH2_2'],
     ['A1', 'B1', 'C1'],
     ['A1', 'B2', 'C3'],
     ['C4', 'C5', 'C6'],
     ['D7', 'D9', 'D9'],
     ['Notes', 'Notes', 'Notes']]


推荐答案

我找到了办法如何使用 Jsoup 和Java 8 Stream API:

I've found a way how to do it using Jsoup and Java 8 Stream API:

//given:
final InputStream html = getClass().getClassLoader().getResourceAsStream("table.html");

//when:
final Document document = Jsoup.parse(html, "UTF-8", "/");

final List<List<String>> result = document.select("table tr")
    .stream()
    // Select all <td> tags in single row
    .map(tr -> tr.select("td"))
    // Repeat n-times those <td> that have `colspan="n"` attribute
    .map(rows -> rows.stream()
        .map(td -> Collections.nCopies(td.hasAttr("colspan") ? Integer.valueOf(td.attr("colspan")) : 1, td))
        .flatMap(Collection::stream)
        .collect(Collectors.toList())
    )
    // Fold final structure to 2D List<List<Element>>
    .reduce(new ArrayList<List<Element>>(), (acc, row) -> {
        // First iteration - just add current row to a final structure
        if (acc.isEmpty()) {
            acc.add(row);
            return acc;
        }

        // If last array in 2D array does not contain element with `rowspan` - append current
        // row and skip to next iteration step
        final List<Element> last = acc.get(acc.size() - 1);
        if (last.stream().noneMatch(td -> td.hasAttr("rowspan"))) {
            acc.add(row);
            return acc;
        }

        // In this case last array in 2D array contains an element with `rowspan` - we are going to
        // add this element n-times to current rows where n == rowspan - 1
        final AtomicInteger index = new AtomicInteger(0);
        last.stream()
            // Map to a helper list of (index in array, rowspan value or 0 if not present, Jsoup element)
            .map(td -> Arrays.asList(index.getAndIncrement(), Integer.valueOf(td.hasAttr("rowspan") ? td.attr("rowspan") : "0"), td))
            // Filter out all elements without rowspan
            .filter(it -> ((int) it.get(1)) > 1)
            // Add all elements with rowspan to current row at the index they are present 
            // (add them with `rowspan="n-1"`)
            .forEach(it -> {
                final int idx = (int) it.get(0);
                final int rowspan = (int) it.get(1);
                final Element td = (Element) it.get(2);

                row.add(idx, rowspan - 1 == 0 ? (Element) td.removeAttr("rowspan") : td.attr("rowspan", String.valueOf(rowspan - 1)));
            });

        acc.add(row);
        return acc;
    }, (a, b) -> a)
    .stream()
    // Extract inner HTML text from Jsoup elements in 2D array
    .map(tr -> tr.stream()
        .map(Element::text)
        .collect(Collectors.toList())
    )
    .collect(Collectors.toList());

我添加了很多评论来解释在特定算法步骤中会发生什么。

I've added a lot of comments that explain what happens at specific algorithm step.

在这个例子中,我使用了以下html文件:

In this example I've used following html file:

<body>
<table border="1">
    <tr><td>H1</td><td colspan="2">H2</td></tr>
    <tr><td></td><td>SubH2_1</td><td>SubH2_2</td></tr>
    <tr><td rowspan="2">A1</td><td>B1</td><td>C1</td></tr>
    <tr><td>B2</td><td>C3</td></tr>
    <tr><td>C4</td><td>C5</td><td>C6</td></tr>
    <tr><td>D7</td><td colspan="2">D9</td></tr>
    <tr><td  colspan="3">Notes</td></tr>
</table>
</body>

它与你的相同,唯一的区别是它有 rowspan 使用率已修复 - 在您的示例中 A1 重复三次而不是两次。此示例中的两个< tr> 也正确关闭,否则在最终结构中会显示另外两个空数组。

It's the same as yours, the only difference is it has rowspan usage fixed - in your example A1 is repeated three times instead of two. Also two <tr> in this example were closed correctly, otherwise two additional empty arrays show up in the final structure.

以下是控制台输出:

[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

您可以在问题中粘贴精确的HTML运行此示例,它将产生一点点不同的输出:

You can run this example with exact HTML as you pasted in your question, it will produce a little bit different output:

[H1, H2, H2]
[]
[, SubH2_1, SubH2_2]
[]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

这些空数组显示,因为HTML中有两个未关闭的< tr> 元素。

Those empty arrays show up because there are two unclosed <tr> elements in your HTML.

<tr><td>H1</td><td colspan="2">H2</td><tr>
<tr><td></td><td>SubH2_1</td><td>SubH2_2</td><tr>

关闭它们并再次运行算法将创建以下输出:

Closing them and running algorithm again will create following output:

[H1, H2, H2]
[, SubH2_1, SubH2_2]
[A1, B1, C1]
[A1, B2, C1]
[A1, B2, C3]
[C4, C5, C6]
[D7, D9, D9]
[Notes, Notes, Notes]

如你所见 A1 存在3次,因为它有属性 rowspan =3 B2 rowspan =2 C1 还有 rowspan =2。它会生成与我的第一个示例中的几乎相同的HTML,但是当您仔细查看这3行时,您会发现它们不在同一像素级别。根据您的预期响应,我已修复输入HTML,使其外观和行为符合您的预期。

As you can see A1 exists 3 times because it has an attribute rowspan="3" and B2 has rowspan="2" and C1 has rowspan="2" as well. It generates HTML that looks "almost" the same as one in my first example, but when you take a closer look to those 3 rows you will see that they are not at the same pixel level. Following your expected response I have fixed the input HTML to look and behave as you expect.

好吧,如果你不能修改输入HTML那么你将不得不:

Well, if you cannot modify input HTML then you will have to:


  • 过滤掉由于创建的所有空数组未公开的< tr> 标签

  • 检查您对 A1 的输出预期, B2 C3 - HTML视图未显示以HTML格式编写的此表的确切结构。

  • filter out all empty arrays created due to unclosed <tr> tags
  • review your output expectations for A1, B2 and C3 - the HTML view does not show the exact structure of this table written in HTML.

在这里你可以找到 完整源代码 JUnit测试我以前找到了你问题的答案。随意下载GitHub上托管的此示例Maven项目随着算法的实施。

Here you can find full source code of a JUnit test I used to found the answer to your question. Feel free to download this sample Maven project hosted on GitHub to play around with the implementation of the algorithm.

我希望它有所帮助。

这篇关于使用Java将复杂HTML表中的数据提取到2d数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆