用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手 [英] HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

查看:86
本文介绍了用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我阅读了将HTML表解析为Python列表的答案?,并尝试使用这些思想来阅读/处理我的从网站下载的本地html
(文件包含一个表,并以<table class="table">标签开头).由于存在两个html标签,我遇到了问题.

I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site
(the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags.

使用<thead>标签时,解析不会拾取标头,并且<tbody>导致xml和lxml完全失败.

With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail.

我尝试使用谷歌搜索解决方案,但答案很可能是嵌入在xml和/或lxml的某些文档中.

I tried googling for a solution but the answer most likely is embedded in some documentation somewhere for xml and/or lxml.

我只是想以最简单的方式插入xml或lxml,但是如果这里的社区为其他可能更合适的稳定/受信任"模块指明了道路,我会很高兴.

I'm just trying to plug into xml or lxml in the simplest way possible, but would be happy if the community here pointed the way to other 'stable/trusted' modules that might be more appropriate.

我意识到我可以在python中编辑字符串以删除标签,但这并不是太优雅,我正在尝试学习新事物.

I realized I could edit the strings in python to remove the tags, but that is not too elegant, and I'm trying to learn new things.

以下是说明问题的精简示例代码:

Here is the stripped down sample code illustrating the problem:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Parse HTML table to list
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, sys
from xml.etree import ElementTree as ET
from lxml import etree


#                  # this setting blows up

s     = """<table class="table">
<thead>
<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>
</thead>
<tbody>
<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>
</tbody>
</table>
"""

#                  # open this up for clear sailing
if False:
    s     = """<table class="table">

<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>


<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>

</table>
"""

s = s.replace('\n','')
print('0:\n'+s)

while True:
    table = ET.XML(s)
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('1:')
        print(values)
    break

while True:
    table = etree.HTML(s).find("body/table")
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('2:')
        print(values)
    break

sys.exit()

推荐答案

在等待显示如何以"Pythonic方式"执行此操作的帮助时,我想到了一种简单的蛮力方法:

While waiting for some help showing how to do this in a 'Pythonic way', I came up with an easy brute force method:

将字符串s设置为第二个选项,并使用给定的<thead><tbody>标签,应用以下代码:

With the string s set to the 2nd option, with the given <thead> and <tbody> labels, apply the following code:

s = ''.join(s.split('<tbody>'))
s = ''.join(s.split('</tbody>'))
s = ''.join(s.split('<thead>'))
s = ''.join(s.split('</thead>'))

这篇关于用于列表解析的HTML表-&lt; TBODY&gt; xml和lxml的活动扳手的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆