美丽的汤代替<与& lt; [英] Beautiful Soup replaces < with <

查看:56
本文介绍了美丽的汤代替<与& lt;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我找到了要替换的文本,但是当我打印soup时,格式被更改了. <div id="content">stuff here</div>变为&lt;div id="content"&gt;stuff here&lt;/div&gt;.如何保存数据?我尝试过print(soup.encode(formatter="none")),但是会产生相同的错误格式.

I've found the text I want to replace, but when I print soup the format gets changed. <div id="content">stuff here</div> becomes &lt;div id="content"&gt;stuff here&lt;/div&gt;. How can i preserve the data? I have tried print(soup.encode(formatter="none")), but that produces the same incorrect format.

from bs4 import BeautifulSoup

with open(index_file) as fp:
    soup = BeautifulSoup(fp,"html.parser")

found = soup.find("div", {"id": "content"})
found.replace_with(data)

当我打印found时,我得到了正确的格式:

When I print found, I get the correct format:

>>> print(found)
<div id="content">stuff</div>

index_file内容如下:

 <!DOCTYPE html>
 <head>
    Apples 
 </head>
 <body>

   <div id="page">
    This is the Id of the page

  <div id="main">

     <div id="content">
       stuff here
     </div>
  </div>
 footer should go here
 </div>
</body>
</html>

推荐答案

found对象不是Python字符串,而是

The found object is not a Python string, it's a Tag that just happens to have a nice string representation. You can verify this by doing

type(found)

Tag是Beautiful Soup为您创建的对象层次结构的一部分,以便您可以与HTML进行交互.另一个这样的对象是 NavigableString . NavigableString很像一个字符串,但是它只能包含将进入HTML内容部分的内容.

A Tag is part of the hierarchy of objects that Beautiful Soup creates for you to be able to interact with the HTML. Another such object is NavigableString. NavigableString is a lot like a string, but it can only contain things that would go into the content portion of the HTML.

这样做的时候

found.replace_with('<div id="content">stuff here</div>')

您要用包含文字文本的NavigableString替换Tag. HTML能够显示该字符串的唯一方法是在执行操作时转义所有尖括号.

you are asking the Tag to be replaced with a NavigableString containing that literal text. The only way for HTML to be able to display that string is to escape all the angle brackets, as it's doing.

您可能希望保留您的Tag,而不是仅仅替换其内容:

Instead of that mess, you probably want to keep your Tag, and replace only it's content:

found.string.replace_with('stuff here')

请注意,正确的替换操作不会覆盖标记.

Notice that the correct replacement does not attempt to overwrite the tags.

当您执行found.replace_with(...)时,名称found所引用的对象将在父层次结构中被替换.但是,名称found始终指向与以前相同的过时对象.这就是为什么打印soup显示更新,而打印found不显示更新的原因.

When you do found.replace_with(...), the object referred to by the name found gets replaced in the parent hierarchy. However, the name found keeps pointing to the same outdated object as before. That is why printing soup shows the update, but printing found does not.

这篇关于美丽的汤代替&lt;与&amp; lt;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆