用lxml编码的大写html标签 [英] upper case html tags encoded in lxml
问题描述
我正在使用lxml.html ...解析html文件.该html文件包含带有小写字母和大写字母的标签.我的代码的一部分如下所示:
I am parsing an html file using lxml.html....The html file contains tags with small case letters and also large case letters. A part of my code is shown below:
response = urllib2.urlopen(link)
html = response.read().decode('cp1251')
content_html = etree.HTML(html_1)
first_link_xpath = content_html.xpath('//TR')
print (first_link_xpath)
我的HTML文件的一小部分显示如下:
A small part of my HTML file is shown below:
<TR>
<TR vAlign="top" align="left">
<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
<TD></TD>
</TR>
</TR>
因此,当我为以下html示例运行以上代码时,它会提供一个空列表.然后,我尝试运行此行first_link_xpath = content_html_1.xpath('//tr/node()')
,所有大写标记在输出中均表示为\r\n\t\t\t\t'
:此问题背后的原因是什么?
So when i run my above code for the below html sample, it gives an empty list. Then i tried to run this line first_link_xpath = content_html_1.xpath('//tr/node()')
, all the upper case tags were represented as \r\n\t\t\t\t'
in the output: What is the reason behind this issue??
注意:如果问题不能令人信服,请通知我
NOte: If the question is not convincing please let me know for modification
推荐答案
为跟进unutbu的回答,我建议您比较lxml.etree.tostring()
.您会看到不同的标签,标签的大小写和层次结构(可能与人类的想法有所不同;)
To follow up on unutbu's answer, I suggest you compare lxml
XML and HTML parsers, especially how they represent documents by asking a representation of the tree back using lxml.etree.tostring()
. You can see the different tags, tags case and hierarchy (which may be different than what a human would think ;)
$ python
>>> import lxml.etree
>>> doc = """<TR>
... <TR vAlign="top" align="left">
... <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
... <TD></TD>
... </TR>
... </TR>"""
>>> xmldoc = lxml.etree.fromstring(doc)
>>> xmldoc
<Element TR at 0x1e79b90>
>>> htmldoc = lxml.etree.HTML(doc)
>>> htmldoc
<Element html at 0x1f0baa0>
>>> lxml.etree.tostring(xmldoc)
'<TR>\n <TR vAlign="top" align="left">\n <!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>-->\n <TD/>\n </TR>\n </TR>'
>>> lxml.etree.tostring(htmldoc)
'<html><body><tr/><tr valign="top" align="left"><!--<TD><B onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>--><td/>\n </tr></body></html>'
>>>
您可以看到,使用HTML解析器创建了html
和body
标记,并且开头有一个空的tr
节点,因为在HTML中tr
不能直接跟随
You can see that with the HTML parser, it created enclosing html
and body
tags, and there is an empty tr
node at the beginning, since in HTML a tr
cannot directly follow a tr
(the HTML fragment you provided is broken, either by a typo error, or the original document is also broken)
然后,再次按照unutbu的建议,您可以试用不同的XPath表达式:
Then, again as suggested by unutbu, you can tryout the different XPath expressions:
>>> xmldoc.xpath('//tr')
[]
>>> xmldoc.xpath('//TR')
[<Element TR at 0x1e79b90>, <Element TR at 0x1f0baf0>]
>>> xmldoc.xpath('//TR/node()')
['\n ', <Element TR at 0x1f0baf0>, '\n ', <!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, '\n ', <Element TD at 0x1f0bb40>, '\n ', '\n ']
>>>
>>> htmldoc.xpath('//tr')
[<Element tr at 0x1f0bbe0>, <Element tr at 0x1f0bc30>]
>>> htmldoc.xpath('//TR')
[]
>>> htmldoc.xpath('//tr/node()')
[<!--<TD><B onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0x1f0bbe0>, '\n ']
>>>
确实,正如unutbu强调的那样,对于HTML,XPath表达式应使用小写标记来选择元素.
An indeed, as unutbu stressed, for HTML, XPath expressions should use lower-case tags to select elements.
对我来说,'\ r \ n \ t \ t \ t \ t'输出不是错误,它只是各个tr
和td
标记之间的空白.对于文本内容,如果不需要此空格,则可以使用lxml.etree.tostring(element, memthod="text", encoding=unicode).strip()
,其中element
例如来自XPath. (这适用于前导和尾随空格).
(请注意,method
参数很重要,默认情况下,它将输出经过上述测试的HTML表示形式)
To me, '\r\n\t\t\t\t' output is not an error, it's simply the whitespace between the various tr
and td
tags. For text content, if you don't want this whitespace, you can use lxml.etree.tostring(element, memthod="text", encoding=unicode).strip()
, where element
comes from XPath for example. (this works for leading and trailing whitespace).
(Note that the method
argument is important, by default, it will output the HTML representation as tested above)
>>> map(lambda element: lxml.etree.tostring(element, method="text", encoding=unicode), htmldoc.xpath('//tr'))
[u'', u'\n ']
>>>
您可以验证文本表示形式是否为空白.
And you can verify that the text representation is all whitespace.
这篇关于用lxml编码的大写html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!