如何使用Python解析带有表的HTML文件 [英] How to parse a HTML file with table using Python

查看:102
本文介绍了如何使用Python解析带有表的HTML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带表的html文件(它很大,所以只给出了示例代码).我想检索表中的值.我从python尝试了HTMLParser库.

I have got a html file with table ( its a large one, so only sample code is given ). I want to retrieve the values in tables. I tried the HTMLParser library from python.

我开始像下面这样编码.然后我发现属性"class"与系统定义的关键字相同.所以它给了我错误.

I started coding like below. Then I found that the attribute "class" is same as system defined keyword. So its giving me error.

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for class in attrs:
                if class == 'Table_row'

p = MyHTMLParser()
p.feed(ht)   

表格的HTML代码

<table class="Table_rows" cellspacing="0" rules="all" border="1" id="MyDataGrid" style="width:700px;border-collapse:collapse;">

                    <tr class="Table_Heading">

                        <td>STATION CODE</td><td>STATION NAME</td><td>SCHEDULED ARRIVAL</td><td>SCHEDULED DEPARTURE</td><td>ACTUAL/ EXPECTED ARRIVAL</td><td>ACTUAL/ EXPECTED DEPARTURE</td>

                    </tr><tr class="Table_row">

                        <td>TVC </td><td style="width:160px;">ORIGON</td><td>Starting Station </td><td>05:00, 07 May 2011</td><td>Starting Station</td><td>05:00, 07 May 2011</td>

                    </tr><tr class="alternat_table_row">

                        <td>TVP </td><td>NEY YORK</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td>

</tr>               
</table>

更新

如何获取标签之间的数据?

How could I get data between the tags?

推荐答案

请注意,handle_starttag方法的文档指出:

Note that the documentation of the handle_starttag method states:

tag参数是 标记转换为小写. attrs 参数是(名称,值)的列表 对包含找到的属性 在标签的<>括号内.

The tag argument is the name of the tag converted to lower case. The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets.

因此,您可能正在寻找类似的东西:

So, you're probably looking for something like:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for name, value in attrs:
                if name == 'class':
                    print 'Found class', value

p = MyHTMLParser()
p.feed(ht)   

打印:

Found class Table_Heading
Found class Table_row
Found class alternat_table_row


P.S.我还建议使用BeautifulSoup使用Python解析HTML.


P.S. I also recommend BeautifulSoup for parsing HTML with Python.

这篇关于如何使用Python解析带有表的HTML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆