混乱使用BeautifulSoup读取HTML表格内容？ [英] Confusion to read html table contents using BeautifulSoup?

查看：1319 发布时间：2016/8/5 19:19:51 python python-2.7 beautifulsoup

本文介绍了混乱使用BeautifulSoup读取HTML表格内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

下面是 HTML 内容：

<table cellspacing="1" cellpadding="0" class="data">
<tr class="colhead">
            <th colspan="3">Expression</th>
        </tr>
        <tr class="colhead">
            <th>Task</th>
            <th>Action</th>
            <th>List</th>
</tr>           
<tr class="rowLight">
    <td width="40%">
            Task1
        </td>
        <td width="20%">
             Assigned to 
        </td>
        <td width="40%">
             Harry
    </td>

</tr>           
<tr class="rowDark">
     <td width="40%">
                    Task2
                </td>
                <td width="20%">
                     Rejected by 
                </td>
                <td width="40%">
                    Lopa 
                </td>
</tr>

<tr class="rowLight">
    <td width="40%">
            Task5
        </td>
        <td width="20%">
             Accepted By 
        </td>
        <td width="40%">
            Mathew
        </td>
</tr>

现在我得值如下：（见下表只不过是Excel表格，我将建立，一旦达到该值。）

Now I have to get the values as below : (the below table is nothing but an Excel table,that i will build up,once reached to the values.)

Task    Action        List
Task1   Assigned to   Harry
Task2   Rejected by   Lopa
Task5   Accepted By   Mathew

一个外行的人解我所知道的，如下：

A lay man solution what I know as below:

   from bs4 import BeautifulSoup
   soup = BeautifulSoup(source_URL)

alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )

t = [x for x in soup.findAll('td')]

[x.renderContents().strip('\n') for x in t]

但在我上面的 HTML 内容，不构成present，所以如何处理？请指导我在这里！

But in my above HTML content such structure not present,so how to approach? Please guide me here!

推荐答案

使用 .stripped_strings 来从一个表行的有趣的文本：

Use .stripped_strings to get the 'interesting' text from a table row:

rows = table.find_all('tr', class_=('rowLight', 'rowDark'))
for row in rows:
    print list(row.stripped_strings)

此输出：

[u'Task1', u'Assigned to', u'Harry']
[u'Task2', u'Rejected by', u'Lopa']
[u'Task5', u'Accepted By', u'Mathew']

或拉一切都变成列表中的一个列表（由请求，不包括最后一行）：

or, to pull everything into one list of lists (with, by request, the last row not included):

data = [list(r.stripped_strings) for r in rows[:-1]]

变成：

data = [[u'Task1', u'Assigned to', u'Harry'], [u'Task2', u'Rejected by', u'Lopa']]

的结果 .find_all（），一个的ResultSet ，行为就像一个Python列表，你可以切片它随意忽略某些行，例如

The result of .find_all(), a ResultSet, acts just like a Python list and you can slice it at will to ignore certain rows, for example.

这篇关于混乱使用BeautifulSoup读取HTML表格内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

混乱使用BeautifulSoup读取HTML表格内容？ [英] Confusion to read html table contents using BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

混乱使用BeautifulSoup读取HTML表格内容？ [英] Confusion to read html table contents using BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭