是否可以使用 sed 可靠地转义正则表达式元字符 [英] Is it possible to escape regex metacharacters reliably with sed

查看:56
本文介绍了是否可以使用 sed 可靠地转义正则表达式元字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否可以编写一个 100% 可靠的 sed 命令来转义输入字符串中的任何正则表达式元字符,以便它可以在后续的 sed 命令中使用.像这样:

I'm wondering whether it is possible to write a 100% reliable sed command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:

#!/bin/bash
# Trying to replace one regex by another in an input file with sed

search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"

# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")

# Use it in a sed command
sed "s/$search/$replace/" input

我知道有更好的工具来处理固定字符串而不是模式,例如 awkperlpython.我只想证明 sed 是否可行.我想说让我们专注于基本的 POSIX 正则表达式以获得更多乐趣!:)

I know that there are better tools to work with fixed strings instead of patterns, for example awk, perl or python. I would just like to prove whether it is possible or not with sed. I would say let's concentrate on basic POSIX regexes to have even more fun! :)

我尝试了很多东西,但任何时候我都能找到破坏我尝试的输入.我认为将其抽象为 script to escape 不会将任何人引向错误的方向.

I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape would not lead anybody into the wrong direction.

顺便说一句,讨论出现了此处.我认为这可能是收集解决方案并可能打破和/或详细阐述它们的好地方.

Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.

推荐答案

注意:

  • 如果您正在寻找基于此答案中讨论的技术的预打包功能:

  • bash 函数,即使在多行替换中也能实现健壮的转义可以在这篇文章的底部找到(加上一个 perl 解决方案,它使用 perl 对这种转义的内置支持).
  • @EdMorton 的回答 包含一个工具(bash 脚本) 可以稳健地执行单行替换.
    • Ed 的回答现在有 改进 版本的 sed 命令,如果您希望 转义字符串,则需要该版本可能与其他正则表达式处理工具一起使用的文字,例如awkperl.简而言之:用于交叉-tool 使用,\ 必须转义为 \\ 而不是 [\],这意味着:而不是 \\br/>sed 's/[^^]/[&]/g;下面使用的s/\^/\\^/g'命令,必须使用
      sed 's/[^^\\]/[&]/g;s/\^/\\^/g;s/\\/\\\\/g'
    • bash functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus a perl solution that uses perl's built-in support for such escaping).
    • @EdMorton's answer contains a tool (bash script) that robustly performs single-line substitutions.
      • Ed's answer now has an improved version of the sed command used below, which is needed if you want to escape string literals for potential use with other regex-processing tools, such as awk and perl. In short: for cross-tool use, \ must be escaped as \\ rather than as [\], which means: instead of the
        sed 's/[^^]/[&]/g; s/\^/\\^/g' command used below, you must use
        sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'

      所有代码片段都假设 bash 作为 shell(符合 POSIX 的重构是可能的):

      All snippets assume bash as the shell (POSIX-compliant reformulations are possible):

      在信用到期时给予信用:我在这个答案中找到了下面使用的正则表达式.上>

      To give credit where credit is due: I found the regex used below in this answer.

      假设搜索字符串是一个单行字符串:

      Assuming that the search string is a single-line string:

      search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3'  # sample input containing metachars.
      
      searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
      
      sed -n "s/$searchEscaped/foo/p" <<<"$search" # if ok, echoes 'foo'
      

      • 除了 ^ 之外的每个字符都放在自己的字符集 [...] 表达式中,以将其视为文字.
        • 注意 ^ 是一个字符.你不能表示为[^],因为它在那个位置有特殊的意义(否定).
          • Every character except ^ is placed in its own character set [...] expression to treat it as a literal.
            • Note that ^ is the one char. you cannot represent as [^], because it has special meaning in that location (negation).
              • 请注意,您不能通过在其前面放置 \ 来转义每个字符,因为这可以将文字字符转换为元字符,例如\<\b 是一些工具中的字边界,\n 是换行符,\{ 是像 \{1,3\} 等 RE 间隔的开始.
              • Note that you cannot just escape every char by putting a \ in front of it because that can turn a literal char into a metachar, e.g. \< and \b are word boundaries in some tools, \n is a newline, \{ is the start of a RE interval like \{1,3\}, etc.

              该方法是稳健的,但效率不高.

              The approach is robust, but not efficient.

              健壮性来自于不是试图预测所有特殊的正则表达式字符——这会因正则表达式方言而异——而是只关注两个特性所有正则表达式方言共享:

              The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:

              • 在字符集中指定文字字符的能力.
              • 能够将文字 ^ 转义为 \^

              sed s/// 命令中的替换字符串不是正则表达式,但它识别 placeholders 指的是整个由正则表达式 (&) 匹配的字符串或按索引 (\1, \2, ...) 的特定捕获组结果,所以这些必须与(习惯的)正则表达式分隔符 / 一起转义.

              The replacement string in a sed s/// command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&) or specific capture-group results by index (\1, \2, ...), so these must be escaped, along with the (customary) regex delimiter, /.

              假设替换字符串是一个单行字符串:

              Assuming that the replacement string is a single-line string:

              replace='Laurel & Hardy; PS\2' # sample input containing metachars.
              
              replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
              
              sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar" # if ok, outputs $replace as is
              



              多线解决方案


              转义多行字符串文字以用作 sed 中的 regex:

              注意:这仅在尝试匹配之前已读取多个输入行(可能是全部)时才有意义.
              由于 sedawk 等工具默认一次在 单行 上运行,因此需要额外的步骤使它们读取多于一行一次排队.



              MULTI-line Solutions


              Escaping a MULTI-LINE string literal for use as a regex in sed:

              Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
              Since tools such as sed and awk operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.

              # Define sample multi-line literal.
              search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
              /def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
              
              # Escape it.
              searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n')           #'
              
              # Use in a Sed command that reads ALL input lines up front.
              # If ok, echoes 'foo'
              sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
              

              • 多行输入字符串中的换行符必须转换为 '\n' strings,这就是换行符在正则表达式中的编码方式.
              • $!a\'$'\n''\\n'string '\n' 附加到每个输出行,但是最后一个(最后一个换行符被忽略,因为它是由 <<< 添加)
              • tr -d '\n 然后从字符串中删除所有 actual 换行符(sed 在打印其模式空间时添加一个),有效地用 '\n' 字符串替换输入中的所有换行符.
                • The newlines in multi-line input strings must be translated to '\n' strings, which is how newlines are encoded in a regex.
                • $!a\'$'\n''\\n' appends string '\n' to every output line but the last (the last newline is ignored, because it was added by <<<)
                • tr -d '\n then removes all actual newlines from the string (sed adds one whenever it prints its pattern space), effectively replacing all newlines in the input with '\n' strings.
                  • -e ':a' -e '$!{N;ba' -e '}'sed 符合 POSIX 的形式一个循环读取所有输入行的习语,因此让后续命令一次对所有输入行进行操作.

                  • -e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.

                  • 如果您使用 GNU sed(仅限),您可以使用它的 -z 选项来简化一次读取所有输入行的过程:
                    sed -z "s/$searchEscaped/foo/";<<<"$search"
                  • If you're using GNU sed (only), you can use its -z option to simplify reading all input lines at once:
                    sed -z "s/$searchEscaped/foo/" <<<"$search"
                  # Define sample multi-line literal.
                  replace='Laurel & Hardy; PS\2
                  Masters\1 & Johnson\2'
                  
                  # Escape it for use as a Sed replacement string.
                  IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
                  replaceEscaped=${REPLY%$'\n'}
                  
                  # If ok, outputs $replace as is.
                  sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar" 
                  

                  • 输入字符串中的换行符必须作为实际的换行符保留,但是 \-escaped.
                  • -e ':a' -e '$!{N;ba' -e '}' 是符合 POSIX 标准的 sed 习语all 输入行一个循环.
                  • 's/[&/\]/\\&/g 转义所有 &\/ 实例,就像在单行解决方案中一样.
                  • s/\n/\\&/g' 然后 \ - 前缀所有实际的换行符.
                  • IFS= read -d '' -r 用于读取 sed 命令的输出 as is(以避免自动删除命令替换 ($(...)) 将执行的尾随换行符.
                  • ${REPLY%$'\n'} 然后删除 单个 尾随换行符,<<< 具有隐式附加到输入.
                    • Newlines in the input string must be retained as actual newlines, but \-escaped.
                    • -e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop.
                    • 's/[&/\]/\\&/g escapes all &, \ and / instances, as in the single-line solution.
                    • s/\n/\\&/g' then \-prefixes all actual newlines.
                    • IFS= read -d '' -r is used to read the sed command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)) would perform).
                    • ${REPLY%$'\n'} then removes a single trailing newline, which the <<< has implicitly appended to the input.
                      • quoteRe() 引号(转义)用于正则表达式
                      • quoteSubst() 引用,用于 s/// 调用的替换字符串.
                      • 都正确处理多行输入
                        • 请注意,因为 sed 在默认情况下一次读取 单行,对多行字符串使用 quoteRe() 只会使sed 命令中的意义,这些命令一次显式读取多(或所有)行.
                        • 此外,使用命令替换 ($(...)) 来调用函数不适用于具有尾随换行符的字符串;在这种情况下,使用类似 IFS= read -d '' -r escapedValue <(quoteSubst "$value")
                        • quoteRe() quotes (escapes) for use in a regex
                        • quoteSubst() quotes for use in the substitution string of a s/// call.
                        • both handle multi-line input correctly
                          • Note that because sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
                          • Also, using command substitutions ($(...)) to call the functions won't work for strings that have trailing newlines; in that event, use something like IFS= read -d '' -r escapedValue <(quoteSubst "$value")
                          # SYNOPSIS
                          #   quoteRe <text>
                          quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
                          

                          # SYNOPSIS
                          #  quoteSubst <text>
                          quoteSubst() {
                            IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
                            printf %s "${REPLY%$'\n'}"
                          }
                          

                          示例:

                          from=$'Cost\(*):\n$3.' # sample input containing metachars. 
                          to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
                          
                          # Should print the unmodified value of $to
                          sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from" 
                          

                          注意使用 -e ':a' -e '$!{N;ba' -e '}' 一次读取所有输入,以便多行替换工作.

                          Note the use of -e ':a' -e '$!{N;ba' -e '}' to read all input at once, so that the multi-line substitution works.

                          Perl 具有内置支持,用于转义任意字符串以供在正则表达式中使用:quotemeta() 函数 或其等效的 \Q...\E 引用.
                          单行和多行字符串的方法相同;例如:

                          Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta() function or its equivalent \Q...\E quoting.
                          The approach is the same for both single- and multi-line strings; for example:

                          from=$'Cost\(*):\n$3.' # sample input containing metachars.
                          to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
                          
                          # Should print the unmodified value of $to.
                          # Note that the replacement value needs NO escaping.
                          perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from" 
                          

                          • 注意使用 -0777 一次读取所有输入,以便多行替换工作.

                            • Note the use of -0777 to read all input at once, so that the multi-line substitution works.

                              -s 选项允许在 -- 之后放置 -<var>=<val> 样式的 Perl 变量定义> 在脚本之后,在任何文件名操作数之前.

                              The -s option allows placing -<var>=<val>-style Perl variable definitions following -- after the script, before any filename operands.

                              这篇关于是否可以使用 sed 可靠地转义正则表达式元字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆