当表格缺少thead元素时,使用beautifulsoup/lxml在HTML表格中检测标题 [英] Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

查看:117
本文介绍了当表格缺少thead元素时,使用beautifulsoup/lxml在HTML表格中检测标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当HTML表的表头没有<thead>元素时,我想检测该表头. (驱动维基百科的MediaWiki,不支持<thead>元素 .)我想在BeautifulSoup和lxml中都使用python来做到这一点.假设我已经有一个table对象,并且想从中删除一个thead对象,一个tbody对象和一个tfoot对象.

I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object.

当前,当存在<thead>标记时,parse_thead会执行以下操作:

Currently, parse_thead does the following when the <thead> tag is present:

  • 在BeautifulSoup中,我使用doc.find_all('table')获取表对象,并且可以使用table.find_all('thead')
  • 在lxml中,我在//table的xpath_expr上使用doc.xpath()获取表对象,并且可以使用table.xpath('.//thead')
  • In BeautifulSoup, I get table objects with doc.find_all('table') and I can use table.find_all('thead')
  • In lxml, I get table objects with doc.xpath() on an xpath_expr on //table, and I can use table.xpath('.//thead')

parse_tbodyparse_tfoot以相同的方式工作. (我没有编写此代码,并且对BS或lxml都不熟悉.)但是,如果没有<thead>,则parse_thead不返回任何内容,而parse_tbody一起返回标头和正文.

and parse_tbody and parse_tfoot work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>, parse_thead returns nothing and parse_tbody returns the header and the body together.

我在下面附加一个 Wikitable实例.它缺少<thead><tbody>.而是将所有行(无论是否包含标题)都包含在<tr>...</tr>中,但是标题行具有<th>元素,主体行具有<td>元素.如果没有<thead>,似乎识别标头的正确标准是从头开始,将行放入标头中,直到找到具有非<th>元素的行."

I append a wikitable instance below as an example. It lacks <thead> and <tbody>. Instead all rows, header or not, are enclosed in <tr>...</tr>, but header rows have <th> elements and body rows have <td> elements. Without <thead>, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>".

我很高兴提出有关如何编写parse_theadparse_tbody的建议.没有足够的经验,我想我可以

I'd appreciate suggestions on how I could write parse_thead and parse_tbody. Without much experience here, I would think I could either

  • 潜入表对象并在解析之前手动插入theadtbody标记(这看起来不错,因为这样我就不必更改任何其他可使用<thead>识别表的代码),或者交替地
  • 更改parse_theadparse_tbody以识别仅具有<th>元素的表行. (无论选择哪种方法,似乎我真的需要以这种方式检测头部-身体的边界.)
  • Dive into the table object and manually insert thead and tbody tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with <thead>), or alternately
  • Change parse_thead and parse_tbody to recognize the table rows that have only <th> elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)

我不知道该怎么做,我很乐意就更明智的选择以及我可能如何做这两个方面提出建议.

I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.

(带有没有标题行和

( Examples with no header rows and multiple header rows. I can't assume it has only one header row.)

<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>

推荐答案

在表不包含<thead>标记的情况下,我们可以使用<th>标记来检测标头.如果一行的所有列都是<th>标记,那么我们可以假定它是标题.基于此,我创建了一个用于识别标题和正文的函数.

We can use <th> tags to detect headers, in case the table doesn't contain <thead> tags. If all columns of a row are <th> tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.

BeautifulSoup的代码:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.select('tr'): 
        if all(t.name == 'th' for t in tr.find_all(recursive=False)): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body 

lxml的代码:

def parse_table(table): 
    head_body = {'head':[], 'body':[]}
    for tr in table.cssselect('tr'): 
        if all(t.tag == 'th' for t in tr.getchildren()): 
            head_body['head'] += [tr]
        else: 
            head_body['body'] += [tr]
    return head_body 

table参数是Beautiful Soup Tag对象或lxml Element对象. head_body是一本字典,包含两个<tr>标签列表,标题行和正文行.

The table parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body is a dictionary that contains two lists of <tr> tags, the header and body rows.

用法示例:

html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)

print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}

这篇关于当表格缺少thead元素时,使用beautifulsoup/lxml在HTML表格中检测标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆