使用python将此html文件转换为xml文件的最佳方法 [英] best way to convert the this html file into an xml file using python

查看:120
本文介绍了使用python将此html文件转换为xml文件的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个html是 here

 <!DOCTYPE HTML PUBLIC -  // W3C // DTD HTML 4.01 Transitional // EN>< html>< head>< META http-equiv =Content-Typecontent =text / html; charset = utf-8>< / head> <身体GT; 

< div bgcolor =#48486c>


< tr height =129>

< td width =719height =129>< / td>

< td width =1height =129>< / td>

< / tr>

< tr height =1>

< td width =720height =1>< / td>

< td width =1height =1>< / td>

< / tr>

< / table>

< table width =720border =0cellspacing =0cellpadding =0align =centerheight =203>

< tr height =20>

< td width =719height =20>< / td>

< td width =1height =20>< / td>

< / tr>

< tr height =69>

< td width =719height =69valign =topalign =left>

< table width =719border =1cellspacing =2cellpadding =0>

< tr>

< td bgcolor =a5fdf8width =390>< b> Stream Name< / b>< / td>

< td bgcolor =a5fdf8width =61>< b>状态< / b>< / td>

< td bgcolor =a5fdf8width =61>< b>持续时间< / b>< / td>

< td bgcolor =a5fdf8width =185>< b>开始< / b>< / td>

< / tr>

< tr bgcolor =white>

< td width =390> c:\streams\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'

< td width =61>< font color =#D0D0D0> ----< / font>< / td>

< td width =61> 00:00:02< / td>

< td width =185> 2010/06 / 15-15:06:17< / td>

< / tr>

< / table>

< / td>

< td width =1height =69>< / td>

< / tr>

< tr height =113>

< td width =720height =113colspan =2valign =topalign =left>

< table width =721border =1cellspacing =2cellpadding =0>

< tr bgcolor =a5fdf8>

< td width =299>< b>测试类别< / b>< / td>

< td width =61>< b>错误< / b>< / td>

< td width =62>< b>警告< / b>< / td>

< td width =275>< b>详情< / b>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#099eac>所有测试(Sony_AVCHD_Test_Discs_60Hz_< WBR> 00001.m2ts)< / font>< / td>

< td width =61>< font color =#ff0000> 34787< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#800000> ETSI TR-101-290测试< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#800000> ISO / IEC传输流测试< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#800000>系统数据T-STD测试< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#099eac> PROG(1) - ; /字体>< / TD>

< td width =61>< font color =#ff0000> 34787< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#099eac> VES(0xe0的)LT; /字体>< / TD>

< td width =61>< font color =#ff0000> 34787< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#1010F0> H.264 / AVC一致性< / font>< / td>

< td width =61>< font color =#ff0000> 34718< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>
$ b $ a< font color =#ff0000> Sony_AVCHD_Test_Discs_60Hz_< WBR> 00001.m2ts_Prog(1)_PID(0x1011)< WBR> _H264_Conf.txt< / font>< / A><峰; br>

< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010>序列< /字体>< / TD>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010>图片及LT; /字体>< / TD>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010>切片< /字体>< / TD>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010>宏块LT; /字体>< / TD>

< td width =61>< font color =#ff0000> 34718< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010>块< /字体>< / TD>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#1010F0> HRD测试< / font>< / td>

< td width =61>< font color =#ff0000> 69< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>

< a>< font color =#ff0000> Sony_AVCHD_Test_Discs_60Hz_< WBR> 00001.m2ts_Prog(1)_PID(0x1011)< WBR> _H264_HRD.txt< / font>< / A><峰; br>

< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#101010> HRD级别< / font>< / td>

< td width =61>< font color =#ff0000> 69< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#800000>视频T-STD测试< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#099eac> AES(0xfd)LT; /字体>< / TD>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =61>< font color =#000000> 0< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#808080>音频级别测试< / font>< / td>

< td width =61>< font color =#808080>已禁用< / font>< / td>

< td width =61>< font color =#808080>已禁用< / font>< / td>

< td width =275>< / td>

< / tr>

< tr bgcolor =white>

< td width =299>< font color =#800000>音频T-STD测试< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =61>< font color =#800000> No Lic< / font>< / td>

< td width =275>< / td>

< / tr>

< / table>

< / td>

< / tr>

< tr height =1>

< td width =719height =1>< / td>

< td width =1height =1>< / td>

< / tr>

< / table>

< / div>



< / body>< / html>

有任何python lib可以做到这一点吗?

感谢

解决方案

BeautifulSoup 几乎可以让你几乎在那里:

 >>>导入BeautifulSoup 
>>> f = open('a.html')
>>>汤= BeautifulSoup.BeautifulSoup(f)
>>> f.close()
>>> g = open('a.xml','w')
>>>打印>> g,soup.prettify()
>>> g.close()

这会正确关闭所有标签。剩下的唯一问题是 doctype 仍然是 HTML - 要将其改为您选择的文档类型,仅限于您需要改变第一行,这并不难,例如,而不是直接打印经过批准的文本,

 >> > lines = soup.prettify()。splitlines()
>>>行[0] =('<!DOCTYPE html PUBLIC - // W3C // DTD XHTML 1.0 Transitional // EN'
'http://www.w3.org/TR/xhtml1/DTD /xhtml1-transitional.dtd\">')
>>>打印>> g,'\\\
'.join(lines)


this html is here :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><META http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>

    <div bgcolor="#48486c">

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" background="http://title.jpg" height="130">

            <tr height="129">

                <td width="719" height="129"></td>

                <td width="1" height="129"></td>

            </tr>

            <tr height="1">

                <td width="720" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" height="203">

            <tr height="20">

                <td width="719" height="20"></td>

                <td width="1" height="20"></td>

            </tr>

            <tr height="69">

                <td width="719" height="69" valign="top" align="left">

                    <table width="719" border="1" cellspacing="2" cellpadding="0">

                        <tr>

                            <td bgcolor="a5fdf8" width="390"><b>Stream Name</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Status</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Duration</b></td>

                            <td bgcolor="a5fdf8" width="185"><b>Start</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="390">c:\streams\ours\Sony_AVCHD_<WBR>Test_Discs_60Hz_00001.m2ts</td>

                            <td width="61"><font color="#D0D0D0">----</font></td>

                            <td width="61">00:00:02</td>

                            <td width="185">2010/06/15-15:06:17</td>

                        </tr>

                    </table>

                </td>

                <td width="1" height="69"></td>

            </tr>

            <tr height="113">

                <td width="720" height="113" colspan="2" valign="top" align="left">

                    <table width="721" border="1" cellspacing="2" cellpadding="0">

                        <tr bgcolor="a5fdf8">

                            <td width="299"><b>Test Category</b></td>

                            <td width="61"><b>Error</b></td>

                            <td width="62"><b>Warning</b></td>

                            <td width="275"><b>Details</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">All Tests (Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ETSI TR-101-290 Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ISO/IEC Transport Stream Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  System Data T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">  Prog(1)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    VES(0xe0)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      H.264/AVC Conformance</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_Conf.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Sequence</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Picture</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Slice</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Macroblock</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Block</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      HRD Tests</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_HRD.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        HRD level</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Video T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    AES(0xfd)</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#808080">      Audio Level Tests</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Audio T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                    </table>

                </td>

            </tr>

            <tr height="1">

                <td width="719" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

    </div>



</body></html>

has any python lib to do this ?

thanks

解决方案

BeautifulSoup gets you almost all the way there:

>>> import BeautifulSoup
>>> f = open('a.html')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> f.close()
>>> g = open('a.xml', 'w')
>>> print >> g, soup.prettify()
>>> g.close()

This closes all tags properly. The only issue remaining is that the doctype remains HTML -- to change that into the doctype of your choice, you only need to change the first line, which is not hard, e.g., instead of printing the prettified text directly,

>>> lines = soup.prettify().splitlines()
>>> lines[0] = ('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"'
                '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">')
>>> print >> g, '\n'.join(lines)

这篇关于使用python将此html文件转换为xml文件的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆