使用 python 和 BeautifulSoup 从 html 中提取表格内容 [英] Extracting table contents from html with python and BeautifulSoup

查看：21 发布时间：2021/12/23 20:00:26 python beautifulsoup screen-scraping

本文介绍了使用 python 和 BeautifulSoup 从 html 中提取表格内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从 html 文档中提取某些信息.例如.它包含一张桌子(在其他具有其他内容的表格中)像这样:

I want to extract certain information out of an html document. E.g. it contains a table (among other tables with other contents) like this:

    <table class="details">
            <tr>
                    <th>Advisory:</th>
                    <td>RHBA-2013:0947-1</td>
            </tr>
            <tr>    
                    <th>Type:</th>
                    <td>Bug Fix Advisory</td>
            </tr>
            <tr>
                    <th>Severity:</th>
                    <td>N/A</td>
            </tr>
            <tr>    
                    <th>Issued on:</th>
                    <td>2013-06-13</td>
            </tr>
            <tr>    
                    <th>Last updated on:</th>
                    <td>2013-06-13</td>
            </tr>

            <tr>
                    <th valign="top">Affected Products:</th>
                    <td><a href="#Red Hat Enterprise Linux ELS (v. 4)">Red Hat Enterprise Linux ELS (v. 4)</a></td>
            </tr>


    </table>

我想提取诸如发布日期:"之类的信息.它看起来像 BeautifulSoup4可以很容易地做到这一点，但不知何故我无法做到这一点.到目前为止我的代码:

I want to extract Information like the date of "Issued on:". It looks like BeautifulSoup4 could do this easyly, but somehow I don't manage to get it right. My code so far:

    from bs4 import BeautifulSoup
    soup=BeautifulSoup(unicodestring_containing_the_entire_htlm_doc)
    table_tag=soup.table
    if table_tag['class'] == ['details']:
            print table_tag.tr.th.get_text() + " " + table_tag.tr.td.get_text()
            a=table_tag.next_sibling
            print  unicode(a)
            print table_tag.contents

这让我得到表格第一行的内容，以及内容列表.但是下一个兄弟的东西不能正常工作，我想我只是用错了.当然，我可以只解析内容，但在我看来，那道美味的汤旨在阻止我们完全这样做(如果我开始解析自己，我可能会像很好地解析整个文档......).如果有人可以启发我如何实现这一点，我将不胜感激.如果有比 BeautifulSoup 更好的方法，我会感兴趣听说过.

This gets me the contents of the first table row, and also a listing of the contents. But the next sibling thing is not working right, I guess I am just using it wrong. Of course I could just parse the contents thingy, but it seems to me that beautiful soup was designed to prevent us from doing exactly this (if I start parsing myself, I might as well parse the whole doc ...). If someone could enlighten me on how to acomplish this, I would be gratefull. If there is a better way then BeautifulSoup, I would be interested to hear about it.

使用 python 和 BeautifulSoup 从 html 中提取表格内容 [英] Extracting table contents from html with python and BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 python 和 BeautifulSoup 从 html 中提取表格内容 [英] Extracting table contents from html with python and BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭