用lxml编码的大写html标签 [英] upper case html tags encoded in lxml

查看:92
本文介绍了用lxml编码的大写html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml.html ...解析html文件.该html文件包含带有小写字母和大写字母的标签.我的代码的一部分如下所示:

I am parsing an html file using lxml.html....The html file contains tags with small case letters and also large case letters. A part of my code is shown below:

        response = urllib2.urlopen(link)
        html = response.read().decode('cp1251')
        content_html = etree.HTML(html_1)
        first_link_xpath =  content_html.xpath('//TR')
        print (first_link_xpath)

我的HTML文件的一小部分显示如下:

A small part of my HTML file is shown below:

<TR>
    <TR vAlign="top" align="left">
        <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
        <TD></TD>
    </TR>
 </TR>

因此,当我为以下html示例运行以上代码时,它会提供一个空列表.然后,我尝试运行此行first_link_xpath = content_html_1.xpath('//tr/node()'),所有大写标记在输出中均表示为\r\n\t\t\t\t':此问题背后的原因是什么?

So when i run my above code for the below html sample, it gives an empty list. Then i tried to run this line first_link_xpath = content_html_1.xpath('//tr/node()') , all the upper case tags were represented as \r\n\t\t\t\t' in the output: What is the reason behind this issue??

注意:如果问题不能令人信服,请通知我

NOte: If the question is not convincing please let me know for modification

推荐答案

为跟进unutbu的回答,我建议您比较 XML和HTML解析器,尤其是通过使用返回的树表示形式来比较它们如何表示文档. lxml.etree.tostring().您会看到不同的标签,标签的大小写和层次结构(可能与人类的想法有所不同;)

To follow up on unutbu's answer, I suggest you compare lxml XML and HTML parsers, especially how they represent documents by asking a representation of the tree back using lxml.etree.tostring(). You can see the different tags, tags case and hierarchy (which may be different than what a human would think ;)

$ python
>>> import lxml.etree
>>> doc = """<TR>
...     <TR vAlign="top" align="left">
...         <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
...         <TD></TD>
...     </TR>
...  </TR>"""
>>> xmldoc = lxml.etree.fromstring(doc)
>>> xmldoc
<Element TR at 0x1e79b90>
>>> htmldoc = lxml.etree.HTML(doc)
>>> htmldoc
<Element html at 0x1f0baa0>
>>> lxml.etree.tostring(xmldoc)
'<TR>\n    <TR vAlign="top" align="left">\n        <!--<TD><B  onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>-->\n        <TD/>\n    </TR>\n </TR>'
>>> lxml.etree.tostring(htmldoc)
'<html><body><tr/><tr valign="top" align="left"><!--<TD><B  onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>--><td/>\n    </tr></body></html>'
>>> 

您可以看到,使用HTML解析器创建了htmlbody标记,并且开头有一个空的tr节点,因为在HTML中tr不能直接跟随(您输入的HTML片段由于输入错误而被破坏,或者原始文档也被破坏了)

You can see that with the HTML parser, it created enclosing html and body tags, and there is an empty tr node at the beginning, since in HTML a tr cannot directly follow a tr (the HTML fragment you provided is broken, either by a typo error, or the original document is also broken)

然后,再次按照unutbu的建议,您可以试用不同的XPath表达式:

Then, again as suggested by unutbu, you can tryout the different XPath expressions:

>>> xmldoc.xpath('//tr')
[]
>>> xmldoc.xpath('//TR')
[<Element TR at 0x1e79b90>, <Element TR at 0x1f0baf0>]
>>> xmldoc.xpath('//TR/node()')
['\n    ', <Element TR at 0x1f0baf0>, '\n        ', <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, '\n        ', <Element TD at 0x1f0bb40>, '\n    ', '\n ']
>>> 
>>> htmldoc.xpath('//tr')
[<Element tr at 0x1f0bbe0>, <Element tr at 0x1f0bc30>]
>>> htmldoc.xpath('//TR')
[]
>>> htmldoc.xpath('//tr/node()')
[<!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0x1f0bbe0>, '\n    ']
>>> 

确实,正如unutbu强调的那样,对于HTML,XPath表达式应使用小写标记来选择元素.

An indeed, as unutbu stressed, for HTML, XPath expressions should use lower-case tags to select elements.

对我来说,'\ r \ n \ t \ t \ t \ t'输出不是错误,它只是各个trtd标记之间的空白.对于文本内容,如果不需要此空格,则可以使用lxml.etree.tostring(element, memthod="text", encoding=unicode).strip(),其中element例如来自XPath. (这适用于前导和尾随空格). (请注意,method参数很重要,默认情况下,它将输出经过上述测试的HTML表示形式)

To me, '\r\n\t\t\t\t' output is not an error, it's simply the whitespace between the various tr and td tags. For text content, if you don't want this whitespace, you can use lxml.etree.tostring(element, memthod="text", encoding=unicode).strip(), where element comes from XPath for example. (this works for leading and trailing whitespace). (Note that the method argument is important, by default, it will output the HTML representation as tested above)

>>> map(lambda element: lxml.etree.tostring(element, method="text", encoding=unicode), htmldoc.xpath('//tr'))
[u'', u'\n    ']
>>> 

您可以验证文本表示形式是否为空白.

And you can verify that the text representation is all whitespace.

这篇关于用lxml编码的大写html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆