如何更换自定义＆LT HTML注释，评论＆GT;分子 [英] How to replace HTML comments with custom <comment> elements

查看：423 发布时间：2016/8/5 19:08:14 python html regex xml beautifulsoup

本文介绍了如何更换自定义＆LT HTML注释，评论＆GT;分子的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我工作的质量转换一些HTML文件来使用Python中BeautifulSoup XML。

I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.

一个样本HTML文件看起来是这样的：

A sample HTML file looks something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

我想通了，如何找到文档类型，并用标签取代＆LT; DOCTYPE＆GT; ...＆LT; / DOCTYPE＆GT; ，但评论是给我一个很多无奈。我想，以取代＆LT HTML注释，评论＆GT; ...＆LT; /评论＆GT; 。在这个例子中的HTML，我能够取代前两HTML注释，但 HTML 标签内的任何和结束HTML标记在最后评论我不是。

I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html tag and the last comment after the closing html tag I was not.

下面是我的code：

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

这是使用BeautifulSoup我的第一次。如何使用BeautifulSoup查找和替换为所有HTML注释＆LT;注释和GT; 标签

This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment> tag?

难道我将它转换为字节流，通过酱菜，其序列，应用正则表达式，然后deseralize回一个 BeautifulSoup 对象？将此项工作或只造成更多的问题？

Could I convert it to a byte stream, via pickle, serializing it, applying regex, and then deseralize it back to a BeautifulSoup object? Would this work or just cause more problems?

我尝试使用子标签对象上泡菜，但反序列化失败，类型错误：__new __（）失踪1人需要的位置参数：'名'。

I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'.

然后我试图酸洗标签只是文本，通过 child.text ，但反序列化失败，原因是 AttributeError的：不能设置属性。基本上， child.text 是只读的，这也解释了为什么正则表达式是行不通的。所以，我不知道如何修改文本。

Then I tried pickling just the text of the tag, via child.text, but deserialization failed due to AttributeError: can't set attribute. Basically, child.text is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.

推荐答案

您有几个问题：

您不能修改 child.text 。这是一个只读属性，只是调用 get_text （） 幕后，其结果是一个全新的字符串无关您的文档。

You can't modify child.text. it's a read-only property that just calls get_text() behind the scenes, and its result is a brand new string unconnected to your document.

应用re.sub（） 不会修改就地什么。你行

re.sub() doesn't modify anything in-place. Your line

re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

将不得不是

child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

...但不会反正工作，因为点1。

... but that wouldn't work anyway, because of point 1.

试图用正则表达式替换它的文本块修改文档是使用BeautifulSoup走错了路。相反，你需要找到节点，与其他节点替换它们。

Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.

下面是一个可行的解决方案：

Here's a solution that works:

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

这code开始通过遍历一个调用的结果 find_all（），采取的事实，即我们可以的传递函数作为文本参数。在BeautifulSoup，注释是 NavigableString 的一个子类，所以我们寻找它，就好像它是一个字符串， 拉姆达... 仅仅是一个速记例如

This code starts by iterating over the result of a call to find_all(), taking advantage of the fact that we can pass a function as the text argument. In BeautifulSoup, Comment is a subclass of NavigableString, so we search for it as though it were a string, and the lambda ... is just a shorthand for e.g.

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

然后，我们创建了一个新的标签用适当的名称，设置它的内容是原始注释的剥离内容，并取代与标签注释我们只是创建。

Then, we create a new Tag with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.

下面是结果：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

这篇关于如何更换自定义＆LT HTML注释，评论＆GT;分子的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何更换自定义＆LT HTML注释，评论＆GT;分子 [英] How to replace HTML comments with custom <comment> elements

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

如何更换自定义＆LT HTML注释，评论＆GT;分子 [英] How to replace HTML comments with custom &lt;comment&gt; elements

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

如何更换自定义＆LT HTML注释，评论＆GT;分子 [英] How to replace HTML comments with custom <comment> elements

登录关闭