我应该怎么做< tr>有rowspan [英] What should I do when <tr> has rowspan

查看：299 发布时间：2018/6/15 12:18:17 python html pandas beautifulsoup

本文介绍了我应该怎么做< tr>有rowspan的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果该行具有rowspan元素，那么如何使该行对应于维基百科页面中的表格。

  from bs4 import BeautifulSoup 
从lxml.html导入urllib2 
 import fromstring 
导入re 
导入CSV 
进口熊猫作为PD 
 
维基= http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records 
头= {'用户代理'：'Mozilla / 5.0'}＃需要防止维基百科上的403错误
 req = urllib2.Request（wiki，headers = header）
 page = urllib2.urlopen（req）
 soup = BeautifulSoup（页面）
 
尝试：
 table = soup.find_all（'table'）[6] 
除AttributeError为e：
 print'找不到表格，退出'
 
尝试：
 first = table.find_all（'tr'）[0] 
 AttributeError除外e：
 print'找不到表格行，退出'
 
尝试：
 allRows = table.find_all（'tr'）[1：-1] 
除AttributeError外e：
 print'找不到表格行，退出'
 
 
 header = [header.get_text（）for header.find_all（['th'，'td']）] 
结果= [[row.find_all（['th'，'td']）] for row in allRows]中的数据的data.get_text（）
 
 
 df = pd.DataFrame （data = results，columns = headers）
 df

我得到表作为输出..但是对于包含行的表所在的行我得到表如下 -

解决方案

，如您所知，
$< tr> ;
< td rowspan =2> 2 =< / td>
< td>西印度群岛< / td>
< td> 4< / td>
< td> Lord's< / td>
< td> 2009< / td>
< / tr>
< tr>
< td style =text-align：left;>印度< / td>
< td> 4< / td>
< td>孟买< / td>
< td> 2012< / td>
< / tr>

所以当 td 有 rowspan 属性，则认为相同的 td vaulue重复用于下一个 tr 级别和 rowspan 的值表示下一个 tr 标记的值。

获取所有这些 rowspan 信息并保存在变量中。保存 tr 标记的序列号，序列号 td 标记， rowspan 即多少 tr 标签具有相同的 td ，文本值 td 。
根据上述方法更新所有 tr 的结果。 b

注意：：仅检查给定的测试用例。需要检查一些更多的测试用例。

code：

  from bs4导入BeautifulSoup 
从lxml.html导入urllib2 
 import fromstring 
导入re 
导入csv 
导入熊猫作为pd 
 
 
 wiki =http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records
 header = {'User-Agent'：'Mozilla / 5.0'}＃需要防止维基百科上的403错误
 req = urllib2.Request（wiki，headers = header）
 page = urllib2.urlopen（req）
 $ b $ soup = BeautifulSoup（page）
 
 table = soup.find_all （'table'）[6] 
 
 tmp = table.find_all（'tr'）
 
 first = tmp [0] 
 allRows = tmp [1： -1] 
＃table.find_all（'tr'）[1：-1] 
 
 
 headers = [header.get_text（）for header in first.find_all（' ']] 
 
 results = [[row.find_all（'td'）]中的数据的data.get_text（）for allRows中的行] 
 
＃< td行跨度= 2 →2 =< / TD> 
＃元组清单（tr的级别，td的级别，total Count，Text Value）
＃ 
＃[（1,0,2，u'2 ='）] 
＃（tr是1，tr中的td序列是0，收获2次，值是2 =）
 rowpan = [] 
 
 for no，tr枚举（allRows）：
 tmp = [] 
 for td_no，枚举中的数据（tr.find_all（'td ））：
打印data.has_key（ 行跨度）
如果data.has_key（ 行跨度）：
 rowspan.append（（无，td_no，INT（数据[rowspan的]），data.get_text（）））
 
 
如果rowspan：
为行中的i：
＃tr行中的值存在于第1位在xrange（1，i [2]）中结果为
：
＃ - 在下一个tr中添加值。 
 results [i [0] + j] .insert（i [1]，i [3]）
 
 
 df = pd.DataFrame（data = results，columns =头文件）
 print df

输出：

 排名对手排名胜地最近的场地季节
 0 1南非6主1951 
 1 2 =西印度群岛4主场2009 
 2 2 =印度4孟买2012 
 3 4澳大利亚3悉尼1932 
 4 5巴基斯坦2特伦特桥1967 
 5 6斯里兰卡1老特拉福德2002

工作到表格10 秩数百播放机匹配局间平均 0 1 25阿拉斯泰尔库克107 191 45.61 1 2 23凯文Pietersen 104 181 47.28 2 3 22科林考德里114 188 44.07 3 3 22沃利哈蒙德85 140 58.46 4 3 22杰弗里抵制108 193 47.72 5个6个21个安德鲁斯特劳斯100 178 40.91 6 6 21伊恩贝尔103 178 45.30 7分配8 = 20 Ken Barrington 82 131 58.67 8 8 = 20 Graham Gooch 118 215 42.58 9 10 19 Len Hutton 79 138 56.67 If the row has rowspan element , how to make the row correspond to the table as in wikipedia page. from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) try: table = soup.find_all('table')[6] except AttributeError as e: print 'No tables found, exiting' try: first = table.find_all('tr')[0] except AttributeError as e: print 'No table row found, exiting' try: allRows = table.find_all('tr')[1:-1] except AttributeError as e: print 'No table row found, exiting' headers = [header.get_text() for header in first.find_all(['th', 'td'])] results = [[data.get_text() for data in row.find_all(['th', 'td'])] for row in allRows] df = pd.DataFrame(data=results, columns=headers) df I get the table as the output.. but for tables where the row contains rowspan - i get table as follows - 解决方案 The problem due to following case , as you know, html content: <tr> <td rowspan="2">2=</td> <td>West Indies</td> <td>4</td> <td>Lord's</td> <td>2009</td> </tr> <tr> <td style="text-align:left;">India</td> <td>4</td> <td>Mumbai</td> <td>2012</td> </tr> so when td have rowspan attribute then consider that same td vaulue is repeated for next tr at same level and the value of rowspan means for next number of tr tags. Get all such rowspan information and save in variable. Save sequence number of tr tag , sequence number of td tag , value of rowspan i.e. how many tr tags have same td, the text value of td. Update result of all tr according to above method. Note:: checked only given test case. Need to check some more test case. code: from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_England_Test_cricket_records" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) table = soup.find_all('table')[6] tmp = table.find_all('tr') first = tmp[0] allRows = tmp[1:-1] #table.find_all('tr')[1:-1] headers = [header.get_text() for header in first.find_all('th')] results = [[data.get_text() for data in row.find_all('td')] for row in allRows] #<td rowspan="2">2=</td> # list of tuple (Level of tr, Level of td, total Count, Text Value) #e.g. #[(1, 0, 2, u'2=')] # (<tr> is 1 , td sequence in tr is 0, reapted 2 times , value is 2=) rowspan = [] for no, tr in enumerate(allRows): tmp = [] for td_no, data in enumerate(tr.find_all('td')): print data.has_key("rowspan") if data.has_key("rowspan"): rowspan.append((no, td_no, int(data["rowspan"]), data.get_text())) if rowspan: for i in rowspan: # tr value of rowspan in present in 1th place in results for j in xrange(1, i[2]): #- Add value in next tr. results[i[0]+j].insert(i[1], i[3]) df = pd.DataFrame(data=results, columns=headers) print df output: Rank Opponent No. wins Most recent venue Season 0 1 South Africa 6 Lord's 1951 1 2= West Indies 4 Lord's 2009 2 2= India 4 Mumbai 2012 3 4 Australia 3 Sydney 1932 4 5 Pakistan 2 Trent Bridge 1967 5 6 Sri Lanka 1 Old Trafford 2002 working to table 10 also Rank Hundreds Player Matches Innings Average 0 1 25 Alastair Cook 107 191 45.61 1 2 23 Kevin Pietersen 104 181 47.28 2 3 22 Colin Cowdrey 114 188 44.07 3 3 22 Wally Hammond 85 140 58.46 4 3 22 Geoffrey Boycott 108 193 47.72 5 6 21 Andrew Strauss 100 178 40.91 6 6 21 Ian Bell 103 178 45.30 7 8= 20 Ken Barrington 82 131 58.67 8 8= 20 Graham Gooch 118 215 42.58 9 10 19 Len Hutton 79 138 56.67 这篇关于我应该怎么做< tr>有rowspan的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我应该怎么做< tr>有rowspan [英] What should I do when <tr> has rowspan

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

我应该怎么做&lt; tr&gt;有rowspan [英] What should I do when &lt;tr&gt; has rowspan

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

我应该怎么做< tr>有rowspan [英] What should I do when <tr> has rowspan

登录关闭