当 <tr> 时我该怎么办?有行跨度 [英] What should I do when <tr> has rowspan
问题描述
如果该行具有 rowspan 元素,如何使该行与维基百科页面中的表格相对应.
If the row has rowspan element , how to make the row correspond to the table as in wikipedia page.
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
try:
table = soup.find_all('table')[6]
except AttributeError as e:
print 'No tables found, exiting'
try:
first = table.find_all('tr')[0]
except AttributeError as e:
print 'No table row found, exiting'
try:
allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
print 'No table row found, exiting'
headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]
df = pd.DataFrame(data=results, columns=headers)
df
<小时>
我将表格作为输出......但是对于行包含 rowspan - 的表格,我得到的表格如下 -
I get the table as the output.. but for tables where the row contains rowspan - i get table as follows -
推荐答案
由于以下情况引起的问题,如您所知,
The problem due to following case , as you know,
html 内容:
<tr>
<td rowspan="2">2=</td>
<td>West Indies</td>
<td>4</td>
<td>Lord's</td>
<td>2009</td>
</tr>
<tr>
<td style="text-align:left;">India</td>
<td>4</td>
<td>Mumbai</td>
<td>2012</td>
</tr>
因此,当 td
具有 rowspan
属性时,请考虑对下一个 tr
重复相同的 td
值rowspan
的 level 和 value 表示 tr
标签的下一个数量.
so when td
have rowspan
attribute then consider that same td
vaulue is repeated for next tr
at same level and the value of rowspan
means for next number of tr
tags.
- 获取所有此类
rowspan
信息并保存在变量中.保存tr
标签的序号,td
标签的序号,rowspan
的值,即tr
标签有多少个同td
,td
的文本值. - 根据上述方法更新所有
tr
的结果.
- Get all such
rowspan
information and save in variable. Save sequence number oftr
tag , sequence number oftd
tag , value ofrowspan
i.e. how manytr
tags have sametd
, the text value oftd
. - Update result of all
tr
according to above method.
注意::只检查给定的测试用例.需要检查更多的测试用例.
Note:: checked only given test case. Need to check some more test case.
代码:
from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring
import re
import csv
import pandas as pd
wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find_all('table')[6]
tmp = table.find_all('tr')
first = tmp[0]
allRows = tmp[1:-1]
#table.find_all('tr')[1:-1]
headers = [header.get_text() for header in first.find_all('th')]
results = [[data.get_text() for data in row.find_all('td')] for row in allRows]
#<td rowspan="2">2=</td>
# list of tuple (Level of tr, Level of td, total Count, Text Value)
#e.g.
#[(1, 0, 2, u'2=')]
# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
rowspan = []
for no, tr in enumerate(allRows):
tmp = []
for td_no, data in enumerate(tr.find_all('td')):
print data.has_key("rowspan")
if data.has_key("rowspan"):
rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))
if rowspan:
for i in rowspan:
# tr value of rowspan in present in 1th place in results
for j in xrange(1, i[2]):
#- Add value in next tr.
results[i[0]+j].insert(i[1], i[3])
df = pd.DataFrame(data=results, columns=headers)
print df
输出:
Rank Opponent No. wins Most recent venue Season
0 1 South Africa 6 Lord's 1951
1 2= West Indies 4 Lord's 2009
2 2= India 4 Mumbai 2012
3 4 Australia 3 Sydney 1932
4 5 Pakistan 2 Trent Bridge 1967
5 6 Sri Lanka 1 Old Trafford 2002
<小时>
也在表 10 上工作
working to table 10 also
Rank Hundreds Player Matches Innings Average
0 1 25 Alastair Cook 107 191 45.61
1 2 23 Kevin Pietersen 104 181 47.28
2 3 22 Colin Cowdrey 114 188 44.07
3 3 22 Wally Hammond 85 140 58.46
4 3 22 Geoffrey Boycott 108 193 47.72
5 6 21 Andrew Strauss 100 178 40.91
6 6 21 Ian Bell 103 178 45.30
7 8= 20 Ken Barrington 82 131 58.67
8 8= 20 Graham Gooch 118 215 42.58
9 10 19 Len Hutton 79 138 56.67
这篇关于当 <tr> 时我该怎么办?有行跨度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!