如何更换自定义&LT HTML注释,评论>分子 [英] How to replace HTML comments with custom <comment> elements

查看:423
本文介绍了如何更换自定义&LT HTML注释,评论>分子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的质量转换一些HTML文件来使用Python中BeautifulSoup XML。

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.

一个样本HTML文件看起来是这样的:

A sample HTML file looks something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

我想通了,如何找到文档类型,并用标签取代&LT; D​​OCTYPE&GT; ...&LT; / DOCTYPE&GT; ,但评论是给我一个很多无奈。我想,以取代&LT HTML注释,评论&GT; ...&LT; /评论&GT; 。在这个例子中的HTML,我能够取代前两HTML注释,但 HTML 标签内的任何和结束HTML标记在最后评论我不是。

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.

下面是我的code:

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

这是使用BeautifulSoup我的第一次。如何使用BeautifulSoup查找和替换为所有HTML注释&LT;注释和GT; 标签

This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?

难道我将它转换为字节流,通过酱菜,其序列,应用正则表达式,然后deseralize回一个 BeautifulSoup 对象?将此项工作或只造成更多的问题?

Could I convert it to a byte stream, via pickle, serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?

我尝试使用子标签对象上泡菜,但反序列化失败,类型错误:__new __()失踪1人需要的位置参数:'名'

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'.

然后我试图酸洗标签只是文本,通过 child.text ,但反序列化失败,原因是 AttributeError的:不能设置属性。基本上, child.text 是只读的,这也解释了为什么正则表达式是行不通的。所以,我不知道如何修改文本。

Then I tried pickling just the text of the tag, via child.text, but deserialization failed due to AttributeError: can't set attribute. Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.

推荐答案

您有几个问题:


  1. 您不能修改 child.text 。这是一个只读属性,只是调用 get_text () 幕后,其结果是一个全新的字符串无关您的文档。

  1. You can't modify child.text. it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document.

应用re.sub() 不会修改就地什么。你行

re.sub() doesn't modify anything in-place. Your line

re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

将不得不是

child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

...但不会反正工作,因为点1。

... but that wouldn't work anyway, because of point 1.

试图用正则表达式替换它的文本块修改文档是使用BeautifulSoup走错了路。相反,你需要找到节点,与其他节点替换它们。

Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.

下面是一个可行的解决方案:

Here's a solution that works:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

这code开始通过遍历一个调用的结果 find_all(),采取的事实,即我们可以的传递函数作为文本参数。在BeautifulSoup,注释 NavigableString 的一个子类,所以我们寻找它,就好像它是一个字符串, 拉姆达... 仅仅是一个速记例如

This code starts by iterating over the result of a call to find_all(), taking advantage of the fact that we can pass a function as the text argument. In BeautifulSoup, Comment is a subclass of NavigableString, so we search for it as though it were a string, and the lambda ... is just a shorthand for e.g.

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

然后,我们创建了一个新的标签用适当的名称,设置它的内容是原始注释的剥离内容,并取代与标签注释我们只是创建。

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.

下面是结果:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

这篇关于如何更换自定义&LT HTML注释,评论&GT;分子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆