从Python和BeautifulSoup HTML提取表内容 [英] Extracting table contents from html with python and BeautifulSoup

查看：148 发布时间：2016/8/5 19:07:21 python beautifulsoup screen-scraping

本文介绍了从Python和BeautifulSoup HTML提取表内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想提取某些信息从一个html文件。例如。它包含一个表
像这样（与其他内容的其他表中）：

 ＆LT;表类=细节＆GT;
            ＆所述; TR＆GT;
                    ＆LT;第i咨询：LT; /第i
                    ＆LT; TD＆GT; RHBA-2013：0947-1＆LT; / TD＆GT;
            ＆LT; / TR＆GT;
            ＆所述; TR＆GT;
                    ＆LT;第i个类型：其中，/第i
                    ＆LT; TD＆GT;的Bug修复谘询及LT; / TD＆GT;
            ＆LT; / TR＆GT;
            ＆所述; TR＆GT;
                    ＆LT;第i严重性：LT; /第i
                    ＆LT; TD＆GT; N / A＆LT; / TD＆GT;
            ＆LT; / TR＆GT;
            ＆所述; TR＆GT;
                    ＆LT;第i发行时间：＆LT; /第i
                    ＆LT; TD＆GT; 2013年6月13日＆LT; / TD＆GT;
            ＆LT; / TR＆GT;
            ＆所述; TR＆GT;
                    ＆LT;第i最后更新：LT; /第i
                    ＆LT; TD＆GT; 2013年6月13日＆LT; / TD＆GT;
            ＆LT; / TR＆GT;            ＆所述; TR＆GT;
                    百分位VALIGN =顶＆GT;受影响的产品：其中; /第i
                    ＆LT; TD＆GT;＆LT; A HREF =＃红帽企业Linux ELS（4节）＆GT;红帽企业Linux ELS＆LT（4节）; / A＆GT;＆LT; / TD＆GT;
            ＆LT; / TR＆GT;
    ＆LT; /表＆gt;

我想提取喜欢的最新信息发布的关于。它看起来像BeautifulSoup4
能做到这一点easyly，但不知何故，我不设法得到它的权利。
我的code迄今：

 从BS4进口BeautifulSoup
    汤= BeautifulSoup（UNI codestring_containing_the_entire_htlm_doc）
    table_tag = soup.table
    如果table_tag ['类'] == ['细节']：
            打印table_tag.tr.th.get_text（）++ table_tag.tr.td.get_text（）
            A = table_tag.next_sibling
            打印UNI code（一）
            打印table_tag.contents

这让我第一个表行的内容，也是内容的列表。
但接下来的事情兄弟不正确的工作，我想我只是用错了。
当然，我可能只是解析内容啄，但在我看来，美丽的汤
被设计为$ P $的正是这一点做（如果我开始分析自己pvent我们，我还不如
还有分析整个文档...）。如果有人能启发我如何acomplish这一点，我
将gratefull。如果有更好的方法，然后BeautifulSoup，我有兴趣
听到这个消息。

解决方案

 ＆GT;＆GT;＆GT;从BS4进口BeautifulSoup
＆GT;＆GT;＆GT;汤= BeautifulSoup（UNI codestring_containing_the_entire_htlm_doc）
＆GT;＆GT;＆GT;表= soup.find（'表'，{'类'：'细节'}）
＆GT;＆GT;＆GT; TH = table.find（'日'，文本='上发布：）
＆GT;＆GT;＆GT;日
＆LT;第i发行时间：＆LT; /第i
＆GT;＆GT;＆GT; TD = th.findNext（'TD'）
＆GT;＆GT;＆GT; TD
＆LT; TD＆GT; 2013年6月13日＆LT; / TD＆GT;
＆GT;＆GT;＆GT; td.text
u'2013-06-13

I want to extract certain information out of an html document. E.g. it contains a table (among other tables with other contents) like this:

    <table class="details">
            <tr>
                    <th>Advisory:</th>
                    <td>RHBA-2013:0947-1</td>
            </tr>
            <tr>    
                    <th>Type:</th>
                    <td>Bug Fix Advisory</td>
            </tr>
            <tr>
                    <th>Severity:</th>
                    <td>N/A</td>
            </tr>
            <tr>    
                    <th>Issued on:</th>
                    <td>2013-06-13</td>
            </tr>
            <tr>    
                    <th>Last updated on:</th>
                    <td>2013-06-13</td>
            </tr>

            <tr>
                    <th valign="top">Affected Products:</th>
                    <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>
            </tr>


    </table>

I want to extract Information like the date of "Issued on:". It looks like BeautifulSoup4 could do this easyly, but somehow I don't manage to get it right. My code so far:

    from bs4 import BeautifulSoup
    soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
    table_tag=soup.table
    if table_tag['class'] == ['details']:
            print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()
            a=table_tag.next_sibling
            print  unicode(a)
            print table_tag.contents

This gets me the contents of the first table row, and also a listing of the contents. But the next sibling thing is not working right, I guess I am just using it wrong. Of course I could just parse the contents thingy, but it seems to me that beautiful soup was designed to prevent us from doing exactly this (if I start parsing myself, I might as well parse the whole doc ...). If someone could enlighten me on how to acomplish this, I would be gratefull. If there is a better way then BeautifulSoup, I would be interested to hear about it.

解决方案

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
>>> table = soup.find('table', {'class': 'details'})
>>> th = table.find('th', text='Issued on:')
>>> th
<th>Issued on:</th>
>>> td = th.findNext('td')
>>> td
<td>2013-06-13</td>
>>> td.text
u'2013-06-13'

这篇关于从Python和BeautifulSoup HTML提取表内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从Python和BeautifulSoup HTML提取表内容 [英] Extracting table contents from html with python and BeautifulSoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从Python和BeautifulSoup HTML提取表内容 [英] Extracting table contents from html with python and BeautifulSoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭