如何在保留html标签/结构的同时在html中查找/替换文本 [英] How to find/replace text in html while preserving html tags/structure
问题描述
我使用正则表达式来转换文本,但我想保留HTML标记。
例如如果我想用堆栈下溢代替堆栈溢出,这应该如
期望的那样工作:如果输入是堆栈< sometag> overflow< / sometag>
,我必须得到 stack< sometag> underflow< / sometag>
(即字符串替换已完成,但
标签仍然存在...
在处理HTML时,使用DOM库而不是正则表达式:
- lxml:解析器,文档和HTML序列化程序,也可以使用BeautifulSoup和html5lib进行解析。和HTML序列化程序。
- html5lib:一个解析器,它有一个序列化程序。
- ElementTree:一个文档对象和XML序列化程序。 b $ b
- cElementTree:作为C扩展实现的文档对象。
- HTMLParser:解析器。
- Genshi:包含解析器,文档和HTML序列化程序。
- xml.dom.minidom:文档构建到标准库中的模型模型,html5lib可以解析该模型。
从 http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ 。
其中,我会推荐lxml,html5lib和BeautifulSoup。
I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack <sometag>overflow</sometag>
, I must obtain stack <sometag>underflow</sometag>
(i.e. the string substitution is done, but the
tags are still there...
Use a DOM library, not regular expressions, when dealing with manipulating HTML:
- lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
- BeautifulSoup: a parser, document, and HTML serializer.
- html5lib: a parser. It has a serializer.
- ElementTree: a document object, and XML serializer
- cElementTree: a document object implemented as a C extension.
- HTMLParser: a parser.
- Genshi: includes a parser, document, and HTML serializer.
- xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.
Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.
Out of these I would recommend lxml, html5lib, and BeautifulSoup.
这篇关于如何在保留html标签/结构的同时在html中查找/替换文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!