解析lxml中的html主体片段 [英] parse html body fragment in lxml

查看:124
本文介绍了解析lxml中的html主体片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析html的片段:

I'm trying to parse a fragment of html:

<body><h1>title</h1><img src=""></body>

我使用lxml.html.fromstring.而且它使我发疯,因为它不断剥离片段的<body>标签:

I use lxml.html.fromstring. And it is driving me insane because it keeps stripping the <body> tag of my fragments:

 > lxml.html.fromstring('<html><h1>a</h1></html>').tag
 'html'
 > lxml.html.fromstring('<div><h1>a</h1></div>').tag
 'div'
 > lxml.html.fromstring('<body><h1>a</h1></body>').tag
 'h1'

我还尝试了document_fromstringfragment_fromstringclean_htmlpage_structure=False等,但没有任何效果.

I've also tried the document_fromstring, fragment_fromstring, clean_html with page_structure=False, etc... nothing works.

我需要使用lxml,因为我将html片段传递给了PyQuery.

I need to use lxml, since I'm passing the html fragment to PyQuery.

我只希望lxml不会弄乱我的html片段.可以这样做吗?

I just want lxml to not mess with my html fragment. Is it possible to do that?

推荐答案

.fragment_fromstring()也会删除<html>标记;基本上,每当您没有拥有HTML文档(带有<html>顶级元素和/或doctype)时,.fromstring()都会退回到.fragment_fromstring(),并且该方法会删除<html><body>标签.

.fragment_fromstring() removes the <html> tag as well; basically, whenever you do not have a HTML document (with a <html> top-level element and/or a doctype), .fromstring() falls back to .fragment_fromstring() and that method removes both the <html> and the <body> tags, always.

解决方法是告诉.fragment_fromstring()给您一个<body> parent 标签:

The work-around is to tell .fragment_fromstring() to give you a <body> parent tag:

>>> lxml.html.fragment_fromstring('<body><h1>a</h1></body>', create_parent='body')
<Element body at 0x10d06fbf0>

这不会保留原始<body>标记上的任何属性.

This does not preserve any attributes on the original <body> tag.

另一种解决方法是使用.document_fromstring()方法,该方法会将文档包装在<html>标记中,然后可以再次将其删除:

Another work-around is to use the .document_fromstring() method, which will wrap your document in a <html> tag, which you then can remove again:

>>> lxml.html.document_fromstring('<body><h1>a</h1></body>')[0]
<Element body at 0x10d06fcb0>

确实保留<body>上的属性:

>>> lxml.html.document_fromstring('<body class="foo"><h1>a</h1></body>')[0].attrib
{'class': 'foo'}

在第一个示例中使用.document_fromstring()函数可以得出:

Using the .document_fromstring() function on your first example gives:

>>> body = lxml.html.document_fromstring('<body><h1>title</h1><img src=""></body>')[0]
>>> lxml.html.tostring(body)
'<body><h1>title</h1><img src=""></body>'

如果只想在没有 HTML标记的情况下执行此操作,请执行lxml.html.fromstring()的操作并测试完整的文档:

If you only want to do this if there is no HTML tag, do what lxml.html.fromstring() does and test for a full document:

htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
if htmltest(inputtext):
    tree = lxml.html.fromstring(inputtext)
else:
    tree = lxml.html.document_fromstring(inputtext)[0]

这篇关于解析lxml中的html主体片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆