用rowpan和colspan解析一张桌子 [英] Parsing a table with rowspan and colspan

查看：147 发布时间：2018/7/6 16:40:43 python html-parsing lxml html-table

本文介绍了用rowpan和colspan解析一张桌子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一张桌子需要解析，特别是这是一个有4个时间段的学校时间表，每周5个时段。我试图解析它，但老实说还没有走得太远因为我对如何处理rowspan和colspan属性感到困惑，因为它们本质上意味着我需要继续缺少数据。

I have a table that I need to parse, specifically it is a school schedule with 4 blocks of time, and 5 blocks of days for every week. I've attempted to parse it, but honestly have not gotten very far because I am stuck with how to deal with rowspan and colspan attributes, because they essentially mean there is a lack of data that I need to continue.

作为我想做的一个例子，这是一张表：

As an example of what I want to do, here's a table:

<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>

<tr>
</tr>

<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>

<tr>
</tr>

我想把那张桌子转换成这个列表：

I want to take that table and convert it into this list:

[[1,1,2,3,4],
 [1,1,2,3,4],
 [1,1,2,5,6],
 [1,1,2,5,6]]

现在我得到一个平面列表，类似于：

Right now I'm getting a flat list, similar to this:

[1,2,3,4,5,6]

但是在字典形式中，包含有关列数和行数的信息跨度，它的描述以及它的周。

But in dictionary form, with information regarding how many columns and rows it spans, a description of it and what week it's in.

显然，这需要适用于rowspan / colspan的所有可能性，并且在同一个表中可以使用多周。

Obviously this needs to work for every possibility of rowspan/colspan, and for multiple weeks in the same table.

html并不像我描绘的那样干净，我遗漏了许多属性，而且文字显然不像1,2,3,4而是描述性文本块。但是，如果我能够解决这个问题，那么它应该很容易融入我已经写过的内容中。

The html is not as clean as I've portrayed it, there are a lot of attributes I've left out, and the text is obviously not as clean cut as 1,2,3,4 but rather blocks of descriptive text. But if I could get this part resolved then it should be easy enough to incorporate into what I've already written.

我一直在使用lxml.html和Python来做这件事，但如果它提供了更简单的解决方案，我愿意使用其他模块。

I've been using lxml.html and Python to do this, but I'm open to using other modules if it provides an easier solution.

我希望有人可以帮助我，因为我真的不知道该怎么做。

I hope someone can help me, because I really don't know what to do.

编辑：

<table>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td rowspan="4">Thing</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
</table>

这导致我出现一些问题，这就是输出

This is causing me some problems, this is outputting

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']

使用reclosedev提供的代码，我需要更改以适应它以便输出

With the code provided by reclosedev, what do I need to change to adapt it so it outputs

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']

相反？

EDIT2：使用reclosedev的新功能，它正在接近一个解决方案，但是仍然存在无法正确放置单元格的情况：

Using reclosedev's new function, it's approaching a solution, but there are still cases where it fails to place cells correctly:

<table> 
    <tr>
        <td> </td>
        <td rowspan="2"> DMAT Aud. 6 </td>
        <td rowspan="4"> Exam</td>
        <td rowspan="2"> DMAT Aud. 7</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
        <td rowspan="2"> OOP Aud. 7</td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
    </tr>
</table>

这样，原始表格显示如下：

With this, the original table shows it as such:

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

但新呼叫输出：

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' CART Aud. 4', ' OOP Aud. 7'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

推荐答案

更新（删除之前的功能）

UPDATE2 已修复和简化。

我的第一个功能是错误的。这是另一个，它正在工作，但需要测试：

My first function was wrong. Here's another one, it's working but needs tests:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict


def table_to_list(table):
    dct = table_to_2d_dict(table)
    return list(iter_2d_dict(dct))


def table_to_2d_dict(table):
    result = defaultdict(lambda : defaultdict(unicode))
    for row_i, row in enumerate(table.xpath('./tr')):
        for col_i, col in enumerate(row.xpath('./td|./th')):
            colspan = int(col.get('colspan', 1))
            rowspan = int(col.get('rowspan', 1))
            col_data = col.text_content()
            while row_i in result and col_i in result[row_i]:
                col_i += 1
            for i in range(row_i, row_i + rowspan):
                for j in range(col_i, col_i + colspan):
                    result[i][j] = col_data
    return result


def iter_2d_dict(dct):
    for i, row in sorted(dct.items()):
        cols = []
        for j, col in sorted(row.items()):
            cols.append(col)
        yield cols


if __name__ == '__main__':
    import lxml.html
    from pprint import pprint

    doc = lxml.html.parse('tables.html')
    for table_el in doc.xpath('//table'):
        table = table_to_list(table_el)
        pprint(table)

tables.html ：

<table border="1">
    <tr>
        <td>1 </td>
        <td>1 </td>
        <td>1 </td>
        <td rowspan="4">Thing</td>
        <td>1 </td>
    </tr>
    <tr>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
    </tr>
    <tr>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
    </tr>
    <tr>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
    </tr>
</table>

<table border="1">
<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>
<tr></tr>
<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>
<tr></tr>
</table>

输出：

[['1 ', '1 ', '1 ', 'Thing', '1 '],
 ['2 ', '2 ', '2 ', 'Thing', '2 '],
 ['3 ', '3 ', '3 ', 'Thing', '3 '],
 ['4 ', '4 ', '4 ', 'Thing', '4 ']]
[['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#5', '#6'],
 ['#1', '#1', '#2', '#5', '#6']]

这篇关于用rowpan和colspan解析一张桌子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用rowpan和colspan解析一张桌子 [英] Parsing a table with rowspan and colspan

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用rowpan和colspan解析一张桌子 [英] Parsing a table with rowspan and colspan

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭