替换文本而不在BeautifulSoup中转义 [英] Replace text without escaping in BeautifulSoup

查看:78
本文介绍了替换文本而不在BeautifulSoup中转义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在BeautifulSoup中包装一些尚未与锚链接链接的单词.我用它来实现它:

I would like to wrap some words that are not already links with anchor links in BeautifulSoup. I use this to achieve it:

from bs4 import BeautifulSoup
import re

text = ''' replace this string '''

soup = BeautifulSoup(text)
pattern = 'replace'

for txt in soup.findAll(text=True):
    if re.search(pattern,txt,re.I) and txt.parent.name != 'a':
        newtext = re.sub(r'(%s)' % pattern,
                         r'<a href="#\1">\1</a>',
                         txt)
        txt.replaceWith(newtext)
print(soup)

不幸的是返回

<html><body><p>&lt;a href="#replace"&gt;replace&lt;/a&gt; this string </p></body></html>

我要寻找的地方:

<html><body><p><a href="#replace">replace</a> this string </p></body></html>

有没有一种方法可以告诉BeautifulSoup不要逃脱链接元素?

Is there a way in which I can tell BeautifulSoup not to escape the link elements?

这里要替换的简单正则表达式将不起作用,因为我最终不仅将拥有一个我想替换的模式,而且拥有多个模式.这就是为什么我决定使用BeautifulSoup排除已经是链接的所有内容的原因.

A simple regex to replace will not do here because I will eventually not only have one pattern that I want to replace but multiple. This is why I decided to use BeautifulSoup to exclude everything that already is a link.

推荐答案

您需要使用

You need to create new tag using new_tag use insert_after to insert part of your text after your newly created a tag.

for txt in soup.find_all(text=True):
    if re.search(pattern, txt, re.I) and txt.parent.name != 'a':
        newtag = soup.new_tag('a')
        newtag.attrs['href'] = "#{}".format(pattern)
        newtag.string = pattern
        txt.replace_with(newtag)
        newtag.insert_after(txt.replace(pattern, ""))

这篇关于替换文本而不在BeautifulSoup中转义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆