当 <tr> 时我该怎么办?有行跨度 [英] What should I do when <tr> has rowspan

查看:12
本文介绍了当 <tr> 时我该怎么办?有行跨度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果该行具有 rowspan 元素,如何使该行与维基百科页面中的表格相对应.

If the row has rowspan element , how to make the row correspond to the table as in wikipedia page.

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd

wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

try:
    table = soup.find_all('table')[6]
except AttributeError as e:
    print 'No tables found, exiting'

try:
    first = table.find_all('tr')[0]
except AttributeError as e:
    print 'No table row found, exiting'

try:
    allRows = table.find_all('tr')[1:-1]
except AttributeError as e:
    print 'No table row found, exiting'


headers = [header.get_text() for header in first.find_all(['th', 'td'])]
results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows]


df = pd.DataFrame(data=results, columns=headers)
df

<小时>

我将表格作为输出......但是对于行包含 rowspan - 的表格,我得到的表格如下 -


I get the table as the output.. but for tables where the row contains rowspan - i get table as follows -

推荐答案

由于以下情况引起的问题,如您所知,

The problem due to following case , as you know,

html 内容:

<tr>
     <td rowspan="2">2=</td>
     <td>West Indies</td>
     <td>4</td>
     <td>Lord's</td>
     <td>2009</td>
</tr>
<tr>
     <td style="text-align:left;">India</td>
     <td>4</td>
     <td>Mumbai</td>
      <td>2012</td>
</tr>

因此,当 td 具有 rowspan 属性时,请考虑对下一个 tr 重复相同的 tdrowspan 的 level 和 value 表示 tr 标签的下一个数量.

so when td have rowspan attribute then consider that same td vaulue is repeated for next tr at same level and the value of rowspan means for next number of tr tags.

  1. 获取所有此类 rowspan 信息并保存在变量中.保存tr标签的序号,td标签的序号,rowspan的值,即tr标签有多少个同tdtd的文本值.
  2. 根据上述方法更新所有tr的结果.
  1. Get all such rowspan information and save in variable. Save sequence number of tr tag , sequence number of td tag , value of rowspan i.e. how many tr tags have same td, the text value of td.
  2. Update result of all tr according to above method.

注意::只检查给定的测试用例.需要检查更多的测试用例.

Note:: checked only given test case. Need to check some more test case.

代码:

from bs4 import BeautifulSoup
import urllib2
from lxml.html import fromstring 
import re
import csv
import pandas as pd


wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)

soup = BeautifulSoup(page)

table = soup.find_all('table')[6]

tmp = table.find_all('tr')

first = tmp[0]
allRows = tmp[1:-1]
#table.find_all('tr')[1:-1]


headers = [header.get_text() for header in first.find_all('th')]

results = [[data.get_text() for data in row.find_all('td')] for row in allRows]

#<td rowspan="2">2=</td>
# list of tuple (Level of tr, Level of td, total Count, Text Value)
#e.g.
#[(1, 0, 2, u'2=')]
# (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=)
rowspan = []

for no, tr in enumerate(allRows):
    tmp = []
    for td_no, data in enumerate(tr.find_all('td')):
        print  data.has_key("rowspan")
        if data.has_key("rowspan"):
            rowspan.append((no, td_no, int(data["rowspan"]), data.get_text()))


if rowspan:
    for i in rowspan:
        # tr value of rowspan in present in 1th place in results
        for j in xrange(1, i[2]):
            #- Add value in next tr.
            results[i[0]+j].insert(i[1], i[3])


df = pd.DataFrame(data=results, columns=headers)
print df

输出:

  Rank       Opponent No. wins Most recent venue Season
0    1   South Africa        6            Lord's   1951
1   2=    West Indies        4            Lord's   2009
2   2=          India        4            Mumbai   2012
3    4      Australia        3            Sydney   1932
4    5       Pakistan        2      Trent Bridge   1967
5    6      Sri Lanka        1      Old Trafford   2002

<小时>

也在表 10 上工作


working to table 10 also

  Rank Hundreds            Player Matches Innings Average
0    1       25     Alastair Cook     107     191   45.61
1    2       23   Kevin Pietersen     104     181   47.28
2    3       22     Colin Cowdrey     114     188   44.07
3    3       22     Wally Hammond      85     140   58.46
4    3       22  Geoffrey Boycott     108     193   47.72
5    6       21    Andrew Strauss     100     178   40.91
6    6       21          Ian Bell     103     178   45.30
7   8=       20    Ken Barrington      82     131   58.67
8   8=       20      Graham Gooch     118     215   42.58
9   10       19        Len Hutton      79     138   56.67

这篇关于当 &lt;tr&gt; 时我该怎么办?有行跨度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆