美丽的汤代替<与& lt; [英] Beautiful Soup replaces < with &lt;
问题描述
我找到了要替换的文本,但是当我打印soup
时,格式被更改了. <div id="content">stuff here</div>
变为<div id="content">stuff here</div>
.如何保存数据?我尝试过print(soup.encode(formatter="none"))
,但是会产生相同的错误格式.
I've found the text I want to replace, but when I print soup
the format gets changed. <div id="content">stuff here</div>
becomes <div id="content">stuff here</div>
. How can i preserve the data? I have tried print(soup.encode(formatter="none"))
, but that produces the same incorrect format.
from bs4 import BeautifulSoup
with open(index_file) as fp:
soup = BeautifulSoup(fp,"html.parser")
found = soup.find("div", {"id": "content"})
found.replace_with(data)
当我打印found
时,我得到了正确的格式:
When I print found
, I get the correct format:
>>> print(found)
<div id="content">stuff</div>
index_file
内容如下:
<!DOCTYPE html>
<head>
Apples
</head>
<body>
<div id="page">
This is the Id of the page
<div id="main">
<div id="content">
stuff here
</div>
</div>
footer should go here
</div>
</body>
</html>
推荐答案
The found
object is not a Python string, it's a Tag
that just happens to have a nice string representation. You can verify this by doing
type(found)
Tag
是Beautiful Soup为您创建的对象层次结构的一部分,以便您可以与HTML进行交互.另一个这样的对象是 NavigableString
. NavigableString
很像一个字符串,但是它只能包含将进入HTML内容部分的内容.
A Tag
is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString
. NavigableString
is a lot like a string, but it can only contain things that would go into the content portion of the HTML.
这样做的时候
found.replace_with('<div id="content">stuff here</div>')
您要用包含文字文本的NavigableString
替换Tag
. HTML能够显示该字符串的唯一方法是在执行操作时转义所有尖括号.
you are asking the Tag
to be replaced with a NavigableString
containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.
您可能希望保留您的Tag
,而不是仅仅替换其内容:
Instead of that mess, you probably want to keep your Tag
, and replace only it's content:
found.string.replace_with('stuff here')
请注意,正确的替换操作不会覆盖标记.
Notice that the correct replacement does not attempt to overwrite the tags.
当您执行found.replace_with(...)
时,名称found
所引用的对象将在父层次结构中被替换.但是,名称found
始终指向与以前相同的过时对象.这就是为什么打印soup
显示更新,而打印found
不显示更新的原因.
When you do found.replace_with(...)
, the object referred to by the name found
gets replaced in the parent hierarchy. However, the name found
keeps pointing to the same outdated object as before. That is why printing soup
shows the update, but printing found
does not.
这篇关于美丽的汤代替<与& lt;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!