如何通过使用Beautifulsoup提取HTML表 [英] How to extract html table by using Beautifulsoup

查看：142 发布时间：2016/8/5 19:15:55 python html html-parsing beautifulsoup parent

本文介绍了如何通过使用Beautifulsoup提取HTML表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

考虑下面的HTML程式码片段为例：

 ＆GT;＆GT;＆GT;汤
＆LT;表＆gt;
＆LT; TR＆GT;＆LT; TD类=ABC＆gt;这是ABC＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; TR＆GT;＆LT; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; /表＆gt;＆LT;表＆gt;
＆所述; TR＆GT;＆下; TD类=EFG＆gt;这是EFG与所述; / TD＆GT;
＆LT; / TR＆GT;
＆LT; TR＆GT;＆LT; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; /表＆gt;

如果我只能通过它的表数据类找到我的愿望表，

 ＆GT;＆GT;＆GT; soup.findAll（TD，{级：ABC}）
[＆LT; TD类=ABC＆gt;这是ABC＆LT; / TD＆GT;]

我怎么能提取整个表如下？

 ＆LT;表＆gt;
＆LT; TR＆GT;＆LT; TD类=ABC＆gt;这是ABC＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; TR＆GT;＆LT; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; /表＆gt;

解决方案

获得 D 标签的的 父 使用的 find_parent（） ：

  soup.find（TD，{级：ABC}）。find_parent（'表'）

演示：

 ＆GT;＆GT;＆GT;从BS4进口BeautifulSoup
＆GT;＆GT;＆GT;数据=
...＆LT; DIV＆GT;
...＆LT;表＆gt;
...＆LT; TR＆GT;＆LT; TD类=ABC＆gt;这是ABC＆LT; / TD＆GT;
...＆LT; / TR＆GT;
...＆所述; TR＆GT;＆下; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
...＆LT; / TR＆GT;
...＆LT; /表＆gt;
...
...＆LT;表＆gt;
...＆所述; TR＆GT;＆下; TD类=EFG＆gt;这是EFG与所述; / TD＆GT;
...＆LT; / TR＆GT;
...＆所述; TR＆GT;＆下; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
...＆LT; / TR＆GT;
...＆LT; /表＆gt;
...＆LT; / DIV＆GT;
......
＆GT;＆GT;＆GT;汤= BeautifulSoup（数据）
＆GT;＆GT;＆GT;打印soup.find（TD，{级：ABC}）。find_parent（'表'）
＆LT;表＆gt;
＆LT; TR＆GT;＆LT; TD类=ABC＆gt;这是ABC＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; TR＆GT;＆LT; TD类=第一资讯＆GT; data1_xxx＆LT; / TD＆GT;
＆LT; / TR＆GT;
＆LT; /表＆gt;

Taking the below html snippet as example:

>>>soup
<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

<table>
<tr><td class="efg">This is EFG</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

If I can only find my desire table by its table data class,

>>>soup.findAll("td",{"class":"abc"})
[<td class="abc">This is ABC</td>]

how can I extract the whole table as below?

<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

解决方案

Get the td tag's parent using find_parent():

soup.find("td", {"class":"abc"}).find_parent('table')

Demo:

>>> from bs4 import BeautifulSoup
>>> data = """
... <div>
...     <table>
...         <tr><td class="abc">This is ABC</td>
...         </tr>
...         <tr><td class="firstdata"> data1_xxx </td>
...         </tr>
...     </table>
... 
...     <table>
...         <tr><td class="efg">This is EFG</td>
...         </tr>
...         <tr><td class="firstdata"> data1_xxx </td>
...         </tr>
...     </table>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>> print soup.find("td", {"class":"abc"}).find_parent('table')
<table>
<tr><td class="abc">This is ABC</td>
</tr>
<tr><td class="firstdata"> data1_xxx </td>
</tr>
</table>

这篇关于如何通过使用Beautifulsoup提取HTML表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过使用Beautifulsoup提取HTML表 [英] How to extract html table by using Beautifulsoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何通过使用Beautifulsoup提取HTML表 [英] How to extract html table by using Beautifulsoup

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭