如何更换自定义&LT HTML注释,评论&GT;分子 [英] How to replace HTML comments with custom <comment> elements
问题描述
我工作的质量转换一些HTML文件来使用Python中BeautifulSoup XML。
I'm working on mass-converting a number of HTML files to XML using BeautifulSoup in Python.
一个样本HTML文件看起来是这样的:
A sample HTML file looks something like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<!-- here is a comment inside the head tag -->
</head>
<body>
...
<!-- Comment inside body tag -->
<!-- Another comment inside body tag -->
<!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
</body>
</html>
<!-- This comment is the last line of the file -->
我想通了,如何找到文档类型,并用标签取代&LT; DOCTYPE&GT; ...&LT; / DOCTYPE&GT;
,但评论是给我一个很多无奈。我想,以取代&LT HTML注释,评论&GT; ...&LT; /评论&GT;
。在这个例子中的HTML,我能够取代前两HTML注释,但 HTML
标签内的任何和结束HTML标记在最后评论我不是。
I figured out how to find the doctype and replace it with the tag <doctype>...</doctype>
, but the commenting is giving me a lot of frustration. I want to replace the HTML comments with <comment>...</comment>
. In this example HTML, I was able to replace the first two HTML comments, but anything inside the html
tag and the last comment after the closing html tag I was not.
下面是我的code:
file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")
for child in soup.children:
# This takes care of the first two HTML comments
if isinstance(child, bs4.Comment):
child.replace_with("<comment>" + child.strip() + "</comment>")
# This should find all nested HTML comments and replace.
# It looks like it works but the changes are not finalized
if isinstance(child, bs4.Tag):
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)
# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))
这是使用BeautifulSoup我的第一次。如何使用BeautifulSoup查找和替换为所有HTML注释&LT;注释和GT;
标签
This is my first time using BeautifulSoup. How do I use BeautifulSoup to find and replace all HTML comments with the <comment>
tag?
难道我将它转换为字节流,通过酱菜
,其序列,应用正则表达式,然后deseralize回一个 BeautifulSoup
对象?将此项工作或只造成更多的问题?
Could I convert it to a byte stream, via pickle
, serializing it, applying regex, and then deseralize it back to a BeautifulSoup
object? Would this work or just cause more problems?
我尝试使用子标签对象上泡菜,但反序列化失败,类型错误:__new __()失踪1人需要的位置参数:'名'
。
I tried using pickle on the child tag object but deserialization fails with TypeError: __new__() missing 1 required positional argument: 'name'
.
然后我试图酸洗标签只是文本,通过 child.text
,但反序列化失败,原因是 AttributeError的:不能设置属性
。基本上, child.text
是只读的,这也解释了为什么正则表达式是行不通的。所以,我不知道如何修改文本。
Then I tried pickling just the text of the tag, via child.text
, but deserialization failed due to AttributeError: can't set attribute
. Basically, child.text
is read-only, which explains why the regex doesn't work. So, I have no idea how to modify the text.
推荐答案
您有几个问题:
-
您不能修改
child.text
。这是一个只读属性,只是调用get_text ()
幕后,其结果是一个全新的字符串无关您的文档。
You can't modify
child.text
. it's a read-only property that just callsget_text()
behind the scenes, and its result is a brand new string unconnected to your document.
应用re.sub()
不会修改就地什么。你行
re.sub()
doesn't modify anything in-place. Your line
re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
将不得不是
child.text = re.sub("(<!--)|(<!--)", "<comment>", child.text, flags=re.MULTILINE)
...但不会反正工作,因为点1。
... but that wouldn't work anyway, because of point 1.
试图用正则表达式替换它的文本块修改文档是使用BeautifulSoup走错了路。相反,你需要找到节点,与其他节点替换它们。
Trying to modify the document by replacing chunks of text in it with a regex is the wrong way to use BeautifulSoup. Instead, you need to find nodes and replace them with other nodes.
下面是一个可行的解决方案:
Here's a solution that works:
import bs4
with open("example.html") as f:
soup = bs4.BeautifulSoup(f)
for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
tag = bs4.Tag(name="comment")
tag.string = comment.strip()
comment.replace_with(tag)
这code开始通过遍历一个调用的结果 find_all()
,采取的事实,即我们可以的传递函数作为文本
参数。在BeautifulSoup,注释
是 NavigableString
的一个子类,所以我们寻找它,就好像它是一个字符串, 拉姆达...
仅仅是一个速记例如
This code starts by iterating over the result of a call to find_all()
, taking advantage of the fact that we can pass a function as the text
argument. In BeautifulSoup, Comment
is a subclass of NavigableString
, so we search for it as though it were a string, and the lambda ...
is just a shorthand for e.g.
def is_comment(e):
return isinstance(e, bs4.Comment)
soup.find_all(text=is_comment)
然后,我们创建了一个新的标签
用适当的名称,设置它的内容是原始注释的剥离内容,并取代与标签注释我们只是创建。
Then, we create a new Tag
with the appropriate name, set its content to be the stripped content of the original comment, and replace the comment with the tag we just created.
下面是结果:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
这篇关于如何更换自定义&LT HTML注释,评论&GT;分子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!