用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手 [英] HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

查看：86 发布时间：2020/5/4 8:38:50 python xml python-3.x lxml

本文介绍了用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我阅读了将HTML表解析为Python列表的答案?，并尝试使用这些思想来阅读/处理我的从网站下载的本地html
(文件包含一个表，并以<table class="table">标签开头).由于存在两个html标签，我遇到了问题.

I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site
(the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags.

使用<thead>标签时，解析不会拾取标头，并且<tbody>导致xml和lxml完全失败.

With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail.

我尝试使用谷歌搜索解决方案，但答案很可能是嵌入在xml和/或lxml的某些文档中.

I tried googling for a solution but the answer most likely is embedded in some documentation somewhere for xml and/or lxml.

我只是想以最简单的方式插入xml或lxml，但是如果这里的社区为其他可能更合适的稳定/受信任"模块指明了道路，我会很高兴.

I'm just trying to plug into xml or lxml in the simplest way possible, but would be happy if the community here pointed the way to other 'stable/trusted' modules that might be more appropriate.

我意识到我可以在python中编辑字符串以删除标签，但这并不是太优雅，我正在尝试学习新事物.

I realized I could edit the strings in python to remove the tags, but that is not too elegant, and I'm trying to learn new things.

以下是说明问题的精简示例代码:

Here is the stripped down sample code illustrating the problem:

#--------*---------*---------*---------*---------*---------*---------*---------*
# Desc: Parse HTML table to list
#--------*---------*---------*---------*---------*---------*---------*---------*
import os, sys
from xml.etree import ElementTree as ET
from lxml import etree


#                  # this setting blows up

s     = """<table class="table">
<thead>
<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>
</thead>
<tbody>
<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>
</tbody>
</table>
"""

#                  # open this up for clear sailing
if False:
    s     = """<table class="table">

<tr><th>PU</th><th>CA</th><th>OC</th><th>Range</th></tr>


<tr>
<td>UTG</td><td></td><td>
</td><td>2.7%, KK+ AQs+ A5s AKo </td>
</tr>
<tr>
<td></td><td>BB</td><td>
</td><td>10.6%, 55+ A9s+ A9o+ </td>
</tr>

</table>
"""

s = s.replace('\n','')
print('0:\n'+s)

while True:
    table = ET.XML(s)
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('1:')
        print(values)
    break

while True:
    table = etree.HTML(s).find("body/table")
    rows = iter(table)
    for row in rows:
        values = [col.text for col in row]
        print('2:')
        print(values)
    break

sys.exit()

用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手 [英] HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用于列表解析的HTML表-&lt; TBODY&gt; xml和lxml的活动扳手 [英] HTML Table to List Parsing - &lt;TBODY&gt; monkey wrench for both xml and lxml

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

用于列表解析的HTML表-< TBODY> xml和lxml的活动扳手 [英] HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

登录关闭