使用 rowspan 和 colspan 解析表 [英] Parsing a table with rowspan and colspan

查看：33 发布时间：2021/12/28 11:17:27 python html-parsing lxml html-table

本文介绍了使用 rowspan 和 colspan 解析表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一张表格需要解析，具体来说是一个学校时间表，每周有 4 个时间段和 5 个时间段.我试图解析它，但老实说还没有走多远，因为我坚持如何处理 rowspan 和 colspan 属性，因为它们本质上意味着缺乏我需要继续的数据.

作为我想要做的一个例子，这是一个表格:

<td colspan="2" rowspan="4">#1</td><td rowspan="4">#2</td><td rowspan="2">#3</td><td rowspan="2">#4</td></tr><tr></tr><tr><td rowspan="2">#5</td><td rowspan="2">#6</td></tr><tr></tr>

我想把那个表格转换成这个列表:

[[1,1,2,3,4],[1,1,2,3,4],[1,1,2,5,6],[1,1,2,5,6]]

现在我得到了一个类似这样的平面列表:

[1,2,3,4,5,6]

但是以字典的形式，包含关于它跨越多少列和行的信息、它的描述以及它在哪一周.

显然，这需要适用于 rowspan/colspan 的每一种可能性，以及在同一个表中的多个星期.

html 不像我描述的那么干净，我遗漏了很多属性，文本显然不像 1、2、3、4 那样清晰，而是描述性的块文本.但是，如果我能解决这部分问题，那么将其合并到我已经编写的内容中应该很容易.

我一直在使用 lxml.html 和 Python 来执行此操作，但如果其他模块提供了更简单的解决方案，我也愿意使用其他模块.

我希望有人能帮助我，因为我真的不知道该怎么办.

<tr><td></td><td></td><td></td><td rowspan="4">东西</td><td></td></tr><tr><td></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td></tr><tr><td></td><td></td><td></td><td></td></tr>

这给我带来了一些问题，这是输出

[' ', ' ', ' ', 'Thing', ' '][' ', ' ', ' ', ' ', ' '][' ', ' ', ' ', ' ', ' '][' ', ' ', ' ', ' ', ' ']

使用reclosedev提供的代码，我需要改变什么来适应它以便输出

[' ', ' ', ' ', 'Thing', ' '][' '， ' '， ' '， '事物'， ' '][' '， ' '， ' '， '事物'， ' '][' '， ' '， ' '， '事物'， ' ']

相反?

使用 reclosedev 的新函数，它正在接近解决方案，但仍然存在无法正确放置单元格的情况:

<tr><td></td><td rowspan="2">DMAT 奥迪.6 </td><td rowspan="4">考试</td><td rowspan="2">DMAT 奥迪.7/td<td></td></tr><tr><td></td><td rowspan="2">购物车4/td</tr><tr><td></td><td rowspan="2">购物车4/td<td rowspan="2">OOP 奥迪.7/td</tr><tr><td></td><td></td></tr>

有了这个，原始表格显示如下:

<预><代码>[[' ', ' DMAT Aud.6 ', '考试', ' DMAT Aud.7', ' '],[' ', ' DMAT Aud.6 ', '考试', ' DMAT Aud.7', ' 购物车4'],[' ', ' CART Aud.4'，'考试'，'OOP Aud.7' , ' 购物车4'],[' ', ' CART Aud.4'，'考试'，'OOP Aud.7' , ' ']]

但是新的调用输出如下:

<预><代码>[[' ', ' DMAT Aud.6 ', '考试', ' DMAT Aud.7', ' '],[' ', ' DMAT Aud.6 ', '考试', ' DMAT Aud.7', ' 购物车4'],[' ', ' CART Aud.4'，'考试'，'CART Aud.4', ' OOP Aud.7'],[' ', ' CART Aud.4'，'考试'，'OOP Aud.7' , ' ']]

解决方案

UPDATE(去掉之前的功能)

UPDATE2 固定和简化.

我的第一个函数是错误的.这是另一个，它可以工作，但需要测试:

#!/usr/bin/env python# -*- 编码:utf-8 -*-从集合导入 defaultdictdef table_to_list(table):dct = table_to_2d_dict(table)返回列表(iter_2d_dict(dct))def table_to_2d_dict(table):结果 = defaultdict(lambda : defaultdict(unicode))对于 row_i，enumerate(table.xpath('./tr')) 中的行:对于 col_i, col in enumerate(row.xpath('./td|./th')):colspan = int(col.get('colspan', 1))rowspan = int(col.get('rowspan', 1))col_data = col.text_content()而结果中的 row_i 和结果中的 col_i [row_i]:col_i += 1对于范围内的 i(row_i，row_i + rowspan):对于范围内的 j(col_i，col_i + colspan):结果[i][j] = col_data返回结果def iter_2d_dict(dct):对于 i，排在 sorted(dct.items()) 中:列 = []对于 j, col in sorted(row.items()):cols.append(col)产量列如果 __name__ == '__main__':导入 lxml.html从 pprint 导入 pprintdoc = lxml.html.parse('tables.html')对于 doc.xpath('//table') 中的 table_el:table = table_to_list(table_el)打印(表)

tables.html:

<tr><td>1 </td><td>1 </td><td>1 </td><td rowspan="4">东西</td><td>1 </td></tr><tr><td>2 </td><td>2 </td><td>2 </td><td>2 </td></tr><tr><td>3 </td><td>3 </td><td>3 </td><td>3 </td></tr><tr><td>4 </td><td>4 </td><td>4 </td><td>4 </td></tr>
<表格边框=1"><tr><td colspan="2" rowspan="4">#1</td><td rowspan="4">#2</td><td rowspan="2">#3</td><td rowspan="2">#4</td></tr><tr></tr><tr><td rowspan="2">#5</td><td rowspan="2">#6</td></tr><tr></tr>

输出:

[['1', '1', '1', 'Thing', '1'],['2', '2', '2', '东西', '2'],['3'，'3'，'3'，'东西'，'3']，['4', '4', '4', '东西', '4']][['#1', '#1', '#2', '#3', '#4'],['#1', '#1', '#2', '#3', '#4'],['#1', '#1', '#2', '#5', '#6'],['#1'、'#1'、'#2'、'#5'、'#6']]

I have a table that I need to parse, specifically it is a school schedule with 4 blocks of time, and 5 blocks of days for every week. I've attempted to parse it, but honestly have not gotten very far because I am stuck with how to deal with rowspan and colspan attributes, because they essentially mean there is a lack of data that I need to continue.

As an example of what I want to do, here's a table:

<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>

<tr>
</tr>

<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>

<tr>
</tr>

I want to take that table and convert it into this list:

[[1,1,2,3,4],
 [1,1,2,3,4],
 [1,1,2,5,6],
 [1,1,2,5,6]]

Right now I'm getting a flat list, similar to this:

[1,2,3,4,5,6]

But in dictionary form, with information regarding how many columns and rows it spans, a description of it and what week it's in.

Obviously this needs to work for every possibility of rowspan/colspan, and for multiple weeks in the same table.

The html is not as clean as I've portrayed it, there are a lot of attributes I've left out, and the text is obviously not as clean cut as 1,2,3,4 but rather blocks of descriptive text. But if I could get this part resolved then it should be easy enough to incorporate into what I've already written.

I've been using lxml.html and Python to do this, but I'm open to using other modules if it provides an easier solution.

I hope someone can help me, because I really don't know what to do.

EDIT:

<table>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td rowspan="4">Thing</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
        <td> </td>
        <td> </td>
    </tr>
</table>

This is causing me some problems, this is outputting

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']
[' ', ' ', ' ', ' ', ' ']

With the code provided by reclosedev, what do I need to change to adapt it so it outputs

[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']
[' ', ' ', ' ', 'Thing', ' ']

Instead?

EDIT2: Using reclosedev's new function, it's approaching a solution, but there are still cases where it fails to place cells correctly:

<table> 
    <tr>
        <td> </td>
        <td rowspan="2"> DMAT Aud. 6 </td>
        <td rowspan="4"> Exam</td>
        <td rowspan="2"> DMAT Aud. 7</td>
        <td> </td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
    </tr>
    <tr>
        <td> </td>
        <td rowspan="2"> CART Aud. 4</td>
        <td rowspan="2"> OOP Aud. 7</td>
    </tr>
    <tr>
        <td> </td>
        <td> </td>
    </tr>
</table>

With this, the original table shows it as such:

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

But the new call outputs this:

[
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' '],
[' ', ' DMAT Aud. 6 ', ' Exam', ' DMAT Aud. 7', ' CART Aud. 4'],
[' ', ' CART Aud. 4' , ' Exam', ' CART Aud. 4', ' OOP Aud. 7'],
[' ', ' CART Aud. 4' , ' Exam', ' OOP Aud. 7' , ' ']
]

解决方案

UPDATE (removed previous function)

UPDATE2 fixed and simplified.

My first function was wrong. Here's another one, it's working but needs tests:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from collections import defaultdict


def table_to_list(table):
    dct = table_to_2d_dict(table)
    return list(iter_2d_dict(dct))


def table_to_2d_dict(table):
    result = defaultdict(lambda : defaultdict(unicode))
    for row_i, row in enumerate(table.xpath('./tr')):
        for col_i, col in enumerate(row.xpath('./td|./th')):
            colspan = int(col.get('colspan', 1))
            rowspan = int(col.get('rowspan', 1))
            col_data = col.text_content()
            while row_i in result and col_i in result[row_i]:
                col_i += 1
            for i in range(row_i, row_i + rowspan):
                for j in range(col_i, col_i + colspan):
                    result[i][j] = col_data
    return result


def iter_2d_dict(dct):
    for i, row in sorted(dct.items()):
        cols = []
        for j, col in sorted(row.items()):
            cols.append(col)
        yield cols


if __name__ == '__main__':
    import lxml.html
    from pprint import pprint

    doc = lxml.html.parse('tables.html')
    for table_el in doc.xpath('//table'):
        table = table_to_list(table_el)
        pprint(table)

tables.html:

<table border="1">
    <tr>
        <td>1 </td>
        <td>1 </td>
        <td>1 </td>
        <td rowspan="4">Thing</td>
        <td>1 </td>
    </tr>
    <tr>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
        <td>2 </td>
    </tr>
    <tr>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
        <td>3 </td>
    </tr>
    <tr>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
        <td>4 </td>
    </tr>
</table>

<table border="1">
<tr>
    <td colspan="2" rowspan="4">#1</td>
    <td rowspan="4">#2</td>
    <td rowspan="2">#3</td>
    <td rowspan="2">#4</td>
</tr>
<tr></tr>
<tr>
    <td rowspan="2">#5</td>
    <td rowspan="2">#6</td>
</tr>
<tr></tr>
</table>

Output:

[['1 ', '1 ', '1 ', 'Thing', '1 '],
 ['2 ', '2 ', '2 ', 'Thing', '2 '],
 ['3 ', '3 ', '3 ', 'Thing', '3 '],
 ['4 ', '4 ', '4 ', 'Thing', '4 ']]
[['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#3', '#4'],
 ['#1', '#1', '#2', '#5', '#6'],
 ['#1', '#1', '#2', '#5', '#6']]

这篇关于使用 rowspan 和 colspan 解析表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 rowspan 和 colspan 解析表 [英] Parsing a table with rowspan and colspan

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 rowspan 和 colspan 解析表 [英] Parsing a table with rowspan and colspan

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭