在引号之间匹配的正则表达式,包含转义引号 [英] Regular expression that matches between quotes, containing escaped quotes

查看:148
本文介绍了在引号之间匹配的正则表达式,包含转义引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这本来是我想问的问题,但在研究问题的细节时,我找到了解决方案,并认为其他人可能会感兴趣.

在 Apache 中,完整的请求用双引号括起来,里面的任何引号总是用反斜杠转义:

In Apache, the full request is in double quotes and any quotes inside are always escaped with a backslash:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"

我正在尝试构建一个匹配所有不同字段的正则表达式.我当前的解决方案总是在 GET/POST 之后的第一个引号处停止(实际上我只需要包括传输的大小在内的所有值):

I'm trying to construct a regex which matches all distinct fields. My current solution always stops on the first quote after the GET/POST (actually I only need all the values including the size transferred):

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)

我想我还会从我的 PHP 源代码中提供我的解决方案,并提供注释和更好的格式:

I guess I'll also provide my solution from my PHP source with comments and better formatting:

$sPattern = ';^' .
    # ip address: 1
    '(\d+\.\d+\.\d+\.\d+)' .
    # ident and user id
    '\s+[^\s]+\s+[^\s]+\s+' .
    # 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
    '\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
    # whitespace
    '\s+' .
    # request uri
    '"[^"]+"' .
    # whitespace
    '\s+' .
    # 8 status code
    '(\d+)' .
    # whitespace
    '\s+' .
    # 9 bytes sent
    '(\d+|-)' .
    # end of regex
    ';';

在 URL 不包含其他引号的简单情况下使用它可以正常工作:

Using this with a simple case where the URL doesn't contain other quotes works fine:

1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"

现在我试图获得对无、一次或多次 \" 的支持,但找不到解决方案.使用 regexpal.com 我想出了这个到目前为止:

Now I'm trying to get support for none, one or more occurrences of \" into it, but can't find a solution. Using regexpal.com I've came up with this so far:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"

这里只是改动的部分:

    # request uri
    '"(.|\\(?="))*"' .

不过,太贪心了.它吃掉最后一个 " 之前的所有东西,而它应该只吃到第一个 " 之前没有 \.我还尝试在我想要的 " 之前引入没有 \ 的要求,但它仍然吃到字符串的末尾(注意:我不得不添加无关的 >\ 字符以使其在 PHP 中工作):

However, it's too greedy. It eats everything until the last ", when it should only eat until the first " not preceded by a \. I also tried introducing the requirement that there's no \ before the " I want, but it still eats to the end of the string (Note: I had to add extraneous \ characters to make this work in PHP):

    # request uri
    '"(.|\\(?="))*[^\\\\]"' .

但是我突然想到:*?:如果在任何量词 、+、? 或 {} 之后立即使用,会使量词非-greedy(匹配最少次数)

    # request uri
    '"(.|\\(?="))*?[^\\\\]"' .

完整的正则表达式:

^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)

2009 年 5 月 5 日更新:

由于解析了数百万行,我在正则表达式中发现了一个小缺陷:它在双引号之前包含反斜杠字符的行中断.换句话说:

I discovered a small flaw in the regexp due parsing millions of lines: it breaks on lines which contain the backslash character right before the double quote. In other words:

...\\"

会破坏正则表达式.Apache 不会记录 ...\" 但总是将反斜杠转义为 \\,因此可以安全地假设双引号前有两个反斜杠字符.

will break the regex. Apache will not log ...\" but will always escape the backslash to \\, so it's safe to assume that when there're two backslash characters before the double quote.

有人知道如何用正则表达式解决这个问题吗?

Anyone has an idea how to fix this with the the regex?

有用的资源:developer.mozilla.org 上的 JavaScript Regexp 文档regexpal.com

推荐答案

试试这个:

"(?:[^\\"]+|\\.)*"

此正则表达式匹配双引号字符后跟除\" 以外的任何字符序列或转义序列\α(其中 α 可以是任何字符)后跟最后一个双引号字符.(?:expr) 语法只是一个非捕获组.

This regular expression matches a double quote character followed by a sequence of either any character other than \ and " or an escaped sequence \α (where α can be any character) followed by the final double quote character. The (?:expr) syntax is just a non-capturing group.

这篇关于在引号之间匹配的正则表达式,包含转义引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆