混乱使用BeautifulSoup读取HTML表格内容? [英] Confusion to read html table contents using BeautifulSoup?
问题描述
下面是 HTML
内容:
<table cellspacing="1" cellpadding="0" class="data">
<tr class="colhead">
<th colspan="3">Expression</th>
</tr>
<tr class="colhead">
<th>Task</th>
<th>Action</th>
<th>List</th>
</tr>
<tr class="rowLight">
<td width="40%">
Task1
</td>
<td width="20%">
Assigned to
</td>
<td width="40%">
Harry
</td>
</tr>
<tr class="rowDark">
<td width="40%">
Task2
</td>
<td width="20%">
Rejected by
</td>
<td width="40%">
Lopa
</td>
</tr>
<tr class="rowLight">
<td width="40%">
Task5
</td>
<td width="20%">
Accepted By
</td>
<td width="40%">
Mathew
</td>
</tr>
现在我得值如下:(见下表只不过是Excel表格,我将建立,一旦达到该值。)
Now I have to get the values as below : (the below table is nothing but an Excel table,that i will build up,once reached to the values.)
Task Action List
Task1 Assigned to Harry
Task2 Rejected by Lopa
Task5 Accepted By Mathew
一个外行的人解我所知道的,如下:
A lay man solution what I know as below:
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_URL)
alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )
t = [x for x in soup.findAll('td')]
[x.renderContents().strip('\n') for x in t]
但在我上面的 HTML
内容,不构成present,所以如何处理?请指导我在这里!
But in my above HTML
content such structure not present,so how to approach? Please guide me here!
推荐答案
使用 .stripped_strings
来从一个表行的有趣的文本:
Use .stripped_strings
to get the 'interesting' text from a table row:
rows = table.find_all('tr', class_=('rowLight', 'rowDark'))
for row in rows:
print list(row.stripped_strings)
此输出:
[u'Task1', u'Assigned to', u'Harry']
[u'Task2', u'Rejected by', u'Lopa']
[u'Task5', u'Accepted By', u'Mathew']
或拉一切都变成列表中的一个列表(由请求,不包括最后一行):
or, to pull everything into one list of lists (with, by request, the last row not included):
data = [list(r.stripped_strings) for r in rows[:-1]]
变成:
data = [[u'Task1', u'Assigned to', u'Harry'], [u'Task2', u'Rejected by', u'Lopa']]
的结果 .find_all()
,一个的ResultSet
,行为就像一个Python列表,你可以切片它随意忽略某些行,例如
The result of .find_all()
, a ResultSet
, acts just like a Python list and you can slice it at will to ignore certain rows, for example.
这篇关于混乱使用BeautifulSoup读取HTML表格内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!