是否可以使用 sed 可靠地转义正则表达式元字符 [英] Is it possible to escape regex metacharacters reliably with sed
问题描述
我想知道是否可以编写一个 100% 可靠的 sed
命令来转义输入字符串中的任何正则表达式元字符,以便它可以在后续的 sed 命令中使用.像这样:
I'm wondering whether it is possible to write a 100% reliable sed
command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:
#!/bin/bash
# Trying to replace one regex by another in an input file with sed
search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"
# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")
# Use it in a sed command
sed "s/$search/$replace/" input
我知道有更好的工具来处理固定字符串而不是模式,例如 awk
、perl
或 python
.我只想证明 sed
是否可行.我想说让我们专注于基本的 POSIX 正则表达式以获得更多乐趣!:)
I know that there are better tools to work with fixed strings instead of patterns, for example awk
, perl
or python
. I would just like to prove whether it is possible or not with sed
. I would say let's concentrate on basic POSIX regexes to have even more fun! :)
我尝试了很多东西,但任何时候我都能找到破坏我尝试的输入.我认为将其抽象为 script to escape
不会将任何人引向错误的方向.
I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape
would not lead anybody into the wrong direction.
顺便说一句,讨论出现了此处.我认为这可能是收集解决方案并可能打破和/或详细阐述它们的好地方.
Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.
推荐答案
注意:
如果您正在寻找基于此答案中讨论的技术的预打包功能:
bash
函数,即使在多行替换中也能实现健壮的转义可以在这篇文章的底部找到(加上一个perl
解决方案,它使用perl
对这种转义的内置支持).- @EdMorton 的回答 包含一个工具(
bash
脚本) 可以稳健地执行单行替换.- Ed 的回答现在有 改进 版本的
sed
命令,如果您希望 转义字符串,则需要该版本可能与其他正则表达式处理工具一起使用的文字,例如awk
和perl
.简而言之:用于交叉-tool 使用,\
必须转义为\\
而不是[\]
,这意味着:而不是\\
br/>sed 's/[^^]/[&]/g;下面使用的s/\^/\\^/g'
命令,必须使用sed 's/[^^\\]/[&]/g;s/\^/\\^/g;s/\\/\\\\/g'
bash
functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus aperl
solution that usesperl
's built-in support for such escaping).- @EdMorton's answer contains a tool (
bash
script) that robustly performs single-line substitutions.- Ed's answer now has an improved version of the
sed
command used below, which is needed if you want to escape string literals for potential use with other regex-processing tools, such asawk
andperl
. In short: for cross-tool use,\
must be escaped as\\
rather than as[\]
, which means: instead of the
sed 's/[^^]/[&]/g; s/\^/\\^/g'
command used below, you must use
sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
所有代码片段都假设
bash
作为 shell(符合 POSIX 的重构是可能的):All snippets assume
bash
as the shell (POSIX-compliant reformulations are possible):在信用到期时给予信用:我在这个答案中找到了下面使用的正则表达式.上>
To give credit where credit is due: I found the regex used below in this answer.
假设搜索字符串是一个单行字符串:
Assuming that the search string is a single-line string:
search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3' # sample input containing metachars. searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it. sed -n "s/$searchEscaped/foo/p" <<<"$search" # if ok, echoes 'foo'
- 除了
^
之外的每个字符都放在自己的字符集[...]
表达式中,以将其视为文字.- 注意
^
是一个字符.你不能表示为[^]
,因为它在那个位置有特殊的意义(否定). - Every character except
^
is placed in its own character set[...]
expression to treat it as a literal.- Note that
^
is the one char. you cannot represent as[^]
, because it has special meaning in that location (negation). - 请注意,您不能通过在其前面放置
\
来转义每个字符,因为这可以将文字字符转换为元字符,例如\<
和\b
是一些工具中的字边界,\n
是换行符,\{
是像\{1,3\}
等 RE 间隔的开始. - Note that you cannot just escape every char by putting a
\
in front of it because that can turn a literal char into a metachar, e.g.\<
and\b
are word boundaries in some tools,\n
is a newline,\{
is the start of a RE interval like\{1,3\}
, etc. - 在字符集中指定文字字符的能力.
- 能够将文字
^
转义为\^
该方法是稳健的,但效率不高.
The approach is robust, but not efficient.
健壮性来自于不是试图预测所有特殊的正则表达式字符——这会因正则表达式方言而异——而是只关注两个特性所有正则表达式方言共享:
The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:
sed
s///
命令中的替换字符串不是正则表达式,但它识别 placeholders 指的是整个由正则表达式 (&
) 匹配的字符串或按索引 (\1
,\2
, ...) 的特定捕获组结果,所以这些必须与(习惯的)正则表达式分隔符/
一起转义.The replacement string in a
sed
s///
command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&
) or specific capture-group results by index (\1
,\2
, ...), so these must be escaped, along with the (customary) regex delimiter,/
.假设替换字符串是一个单行字符串:
Assuming that the replacement string is a single-line string:
replace='Laurel & Hardy; PS\2' # sample input containing metachars. replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar" # if ok, outputs $replace as is
多线解决方案
转义多行字符串文字以用作
sed
中的 regex:注意:这仅在尝试匹配之前已读取多个输入行(可能是全部)时才有意义.
由于sed
和awk
等工具默认一次在 单行 上运行,因此需要额外的步骤使它们读取多于一行一次排队.
MULTI-line Solutions
Escaping a MULTI-LINE string literal for use as a regex in
sed
:Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
Since tools such assed
andawk
operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.# Define sample multi-line literal. search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3 /def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4' # Escape it. searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n') #' # Use in a Sed command that reads ALL input lines up front. # If ok, echoes 'foo' sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
- 多行输入字符串中的换行符必须转换为
'\n'
strings,这就是换行符在正则表达式中的编码方式. $!a\'$'\n''\\n'
将 string'\n'
附加到每个输出行,但是最后一个(最后一个换行符被忽略,因为它是由<<<
添加)tr -d '\n
然后从字符串中删除所有 actual 换行符(sed
在打印其模式空间时添加一个),有效地用'\n'
字符串替换输入中的所有换行符.- The newlines in multi-line input strings must be translated to
'\n'
strings, which is how newlines are encoded in a regex. $!a\'$'\n''\\n'
appends string'\n'
to every output line but the last (the last newline is ignored, because it was added by<<<
)tr -d '\n
then removes all actual newlines from the string (sed
adds one whenever it prints its pattern space), effectively replacing all newlines in the input with'\n'
strings.-e ':a' -e '$!{N;ba' -e '}'
是sed
符合 POSIX 的形式一个循环读取所有输入行的习语,因此让后续命令一次对所有输入行进行操作.-e ':a' -e '$!{N;ba' -e '}'
is the POSIX-compliant form of ased
idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.- 如果您使用 GNU
sed
(仅限),您可以使用它的-z
选项来简化一次读取所有输入行的过程:sed -z "s/$searchEscaped/foo/";<<<"$search"
- If you're using GNU
sed
(only), you can use its-z
option to simplify reading all input lines at once:
sed -z "s/$searchEscaped/foo/" <<<"$search"
# Define sample multi-line literal. replace='Laurel & Hardy; PS\2 Masters\1 & Johnson\2' # Escape it for use as a Sed replacement string. IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace") replaceEscaped=${REPLY%$'\n'} # If ok, outputs $replace as is. sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"
- 输入字符串中的换行符必须作为实际的换行符保留,但是
\
-escaped. -e ':a' -e '$!{N;ba' -e '}'
是符合 POSIX 标准的sed
习语all 输入行一个循环.'s/[&/\]/\\&/g
转义所有&
、\
和/
实例,就像在单行解决方案中一样.s/\n/\\&/g'
然后\
- 前缀所有实际的换行符.IFS= read -d '' -r
用于读取sed
命令的输出 as is(以避免自动删除命令替换 ($(...)
) 将执行的尾随换行符.${REPLY%$'\n'}
然后删除 单个 尾随换行符,<<<
具有隐式附加到输入.- Newlines in the input string must be retained as actual newlines, but
\
-escaped. -e ':a' -e '$!{N;ba' -e '}'
is the POSIX-compliant form of ased
idiom that reads all input lines a loop.'s/[&/\]/\\&/g
escapes all&
,\
and/
instances, as in the single-line solution.s/\n/\\&/g'
then\
-prefixes all actual newlines.IFS= read -d '' -r
is used to read thesed
command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)
) would perform).${REPLY%$'\n'}
then removes a single trailing newline, which the<<<
has implicitly appended to the input.quoteRe()
引号(转义)用于正则表达式quoteSubst()
引用,用于s///
调用的替换字符串.- 都正确处理多行输入
- 请注意,因为
sed
在默认情况下一次读取 单行,对多行字符串使用quoteRe()
只会使sed
命令中的意义,这些命令一次显式读取多(或所有)行. - 此外,使用命令替换 (
$(...)
) 来调用函数不适用于具有尾随换行符的字符串;在这种情况下,使用类似IFS= read -d '' -r escapedValue <(quoteSubst "$value")
quoteRe()
quotes (escapes) for use in a regexquoteSubst()
quotes for use in the substitution string of as///
call.- both handle multi-line input correctly
- Note that because
sed
reads a single line at at time by default, use ofquoteRe()
with multi-line strings only makes sense insed
commands that explicitly read multiple (or all) lines at once. - Also, using command substitutions (
$(...)
) to call the functions won't work for strings that have trailing newlines; in that event, use something likeIFS= read -d '' -r escapedValue <(quoteSubst "$value")
# SYNOPSIS # quoteRe <text> quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# SYNOPSIS # quoteSubst <text> quoteSubst() { IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1") printf %s "${REPLY%$'\n'}" }
示例:
from=$'Cost\(*):\n$3.' # sample input containing metachars. to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars. # Should print the unmodified value of $to sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"
注意使用
-e ':a' -e '$!{N;ba' -e '}'
一次读取所有输入,以便多行替换工作.Note the use of
-e ':a' -e '$!{N;ba' -e '}'
to read all input at once, so that the multi-line substitution works.Perl 具有内置支持,用于转义任意字符串以供在正则表达式中使用:
quotemeta()
函数 或其等效的\Q...\E
引用.
单行和多行字符串的方法相同;例如:Perl has built-in support for escaping arbitrary strings for literal use in a regex: the
quotemeta()
function or its equivalent\Q...\E
quoting.
The approach is the same for both single- and multi-line strings; for example:from=$'Cost\(*):\n$3.' # sample input containing metachars. to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars. # Should print the unmodified value of $to. # Note that the replacement value needs NO escaping. perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"
注意使用
-0777
一次读取所有输入,以便多行替换工作.Note the use of
-0777
to read all input at once, so that the multi-line substitution works.-s
选项允许在--
之后放置-<var>=<val>
样式的 Perl 变量定义> 在脚本之后,在任何文件名操作数之前.The
-s
option allows placing-<var>=<val>
-style Perl variable definitions following--
after the script, before any filename operands.这篇关于是否可以使用 sed 可靠地转义正则表达式元字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Note that because
- 请注意,因为
- Note that
- 注意
- Ed's answer now has an improved version of the
- Ed 的回答现在有 改进 版本的