BeautifulSoup HTML解析表 [英] BeautifulSoup HTML table parsing

查看：222 发布时间：2016/8/5 18:56:45 python table beautifulsoup mechanize html-parsing

本文介绍了BeautifulSoup HTML解析表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析来自该网站的信息（HTML表格）：<一href=\"http://www.511virginia.org/RoadConditions.aspx?j=All&r=1\">http://www.511virginia.org/RoadConditions.aspx?j=All&r=1

目前我使用BeautifulSoup和code我看起来像这样

 从机械化导入浏览器
从BeautifulSoup进口BeautifulSoup机甲=浏览器（）URL =http://www.511virginia.org/RoadConditions.aspx?j=All&r=1
页= mech.open（URL）HTML = page.read（）
汤= BeautifulSoup（HTML）表= soup.find（表）行= table.findAll（'TR'）[3]COLS = rows.findAll（'TD'）roadtype = COLS [0] .string
开始= COLS。[1] .string
最终= COLS [2] .string
条件= COLS [3] .string
原因= COLS [4] .string
更新= COLS [5] .string进入=（roadtype，开始，结束，条件，原因，更新）打印入口

的问题是与开始和结束列。他们只是打印为无

输出：

 （u'Rt。613N（贾尔斯县），无，无，u'Moderate'，u'snow或冰'，u'01 /二千零一十分之十三10:50上午'）

我知道，他们将存储在列名单，但似乎额外的链接标签被搞乱了原始的HTML看起来像这样解析：

 ＆LT; TD标题=公路型级=ConditionsCellText＆GT;保留时间。 613N（贾尔斯县）LT; / TD＆GT;
＆LT; TD标题=启动级=ConditionsCellText＆GT;＆LT; A HREF =conditions.aspx纬度= 37.43036753＆放大器;长= -80.51118005＃视图地图＆GT;大石溪CK路; RT。 635E / W（贾尔斯县）LT; / A＆GT;＆LT; / TD＆GT;
＆LT; TD标题=结束级=ConditionsCellText＆GT;＆LT; A HREF =conditions.aspx纬度= 37.43036753＆放大器;长= -80.51118005＃视图地图＆GT;客舱号法律公告;落基山的路; RT。 721E / W（贾尔斯县）LT; / A＆GT;＆LT; / TD＆GT;
＆LT; TD标题=条件级=ConditionsCellText＆GT;中度LT; / TD＆GT;
＆LT; TD标题=理由级=ConditionsCellText＆GT;雪或冰＆LT; / TD＆GT;
＆LT; TD标题=更新级=ConditionsCellText＆GT; 2010年1月13日上午10点五十零＆LT; / TD＆GT;

那么应该怎么印的是：

 （u'Rt 613N（贾尔斯县），u'Big石溪CK路;采用RT 635E / W（贾尔斯县），u'Cabin LN;落基山的路;; RT 721E / W（贾尔斯县），u'Moderate'，u'snow或冰'，u'01 /二千零十分之一十三10:50 AM）

任何建议或帮助是AP preciated，并感谢你提前。

解决方案

 开始= COLS [1] .find（'A'）。字符串

或更简单

 开始= COLS [1] .a.string

或更好

 开始= STR（COLS [1] .find（文= TRUE））

和

 项= [STR（X）对于x在cols.findAll（文= TRUE）]

I am trying to parse information (html tables) from this site: http://www.511virginia.org/RoadConditions.aspx?j=All&r=1

Currently I am using BeautifulSoup and the code I have looks like this

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.511virginia.org/RoadConditions.aspx?j=All&r=1"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table")

rows = table.findAll('tr')[3]

cols = rows.findAll('td')

roadtype = cols[0].string
start = cols.[1].string
end = cols[2].string
condition = cols[3].string
reason = cols[4].string
update = cols[5].string

entry = (roadtype, start, end, condition, reason, update)

print entry

The issue is with the start and end columns. They just get printed as "None"

Output:

(u'Rt. 613N (Giles County)', None, None, u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

I know that they get stored in the columns list, but it seems that the extra link tag is messing up the parsing with the original html looking like this:

<td headers="road-type" class="ConditionsCellText">Rt. 613N (Giles County)</td>
<td headers="start" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Big Stony Ck Rd; Rt. 635E/W (Giles County)</a></td>
<td headers="end" class="ConditionsCellText"><a href="conditions.aspx?lat=37.43036753&long=-80.51118005#viewmap">Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)</a></td>
<td headers="condition" class="ConditionsCellText">Moderate</td>
<td headers="reason" class="ConditionsCellText">snow or ice</td>
<td headers="update" class="ConditionsCellText">01/13/2010 10:50 AM</td>

so what should be printed is:

(u'Rt. 613N (Giles County)', u'Big Stony Ck Rd; Rt. 635E/W (Giles County)', u'Cabin Ln; Rocky Mount Rd; Rt. 721E/W (Giles County)', u'Moderate', u'snow or ice', u'01/13/2010 10:50 AM')

Any suggestions or help is appreciated, and thank you in advance.

解决方案

start = cols[1].find('a').string

or simpler

start = cols[1].a.string

or better

start = str(cols[1].find(text=True))

and

entry = [str(x) for x in cols.findAll(text=True)]

这篇关于BeautifulSoup HTML解析表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

BeautifulSoup HTML解析表 [英] BeautifulSoup HTML table parsing

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

BeautifulSoup HTML解析表 [英] BeautifulSoup HTML table parsing

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭