在引号之间匹配的正则表达式,包含转义引号 [英] Regular expression that matches between quotes, containing escaped quotes
问题描述
这本来是我想问的问题,但在研究问题的细节时,我找到了解决方案,并认为其他人可能会感兴趣.
在 Apache 中,完整的请求用双引号括起来,里面的任何引号总是用反斜杠转义:
In Apache, the full request is in double quotes and any quotes inside are always escaped with a backslash:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\" foo=bat\" HTTP/1.0" 400 299 "-" "-" "-"
我正在尝试构建一个匹配所有不同字段的正则表达式.我当前的解决方案总是在 GET
/POST
之后的第一个引号处停止(实际上我只需要包括传输的大小在内的所有值):
I'm trying to construct a regex which matches all distinct fields. My current solution always stops on the first quote after the GET
/POST
(actually I only need all the values including the size transferred):
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"[^"]+"\s+(\d+)\s+(\d+|-)
我想我还会从我的 PHP 源代码中提供我的解决方案,并提供注释和更好的格式:
I guess I'll also provide my solution from my PHP source with comments and better formatting:
$sPattern = ';^' .
# ip address: 1
'(\d+\.\d+\.\d+\.\d+)' .
# ident and user id
'\s+[^\s]+\s+[^\s]+\s+' .
# 2 day/3 month/4 year:5 hh:6 mm:7 ss +timezone
'\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]' .
# whitespace
'\s+' .
# request uri
'"[^"]+"' .
# whitespace
'\s+' .
# 8 status code
'(\d+)' .
# whitespace
'\s+' .
# 9 bytes sent
'(\d+|-)' .
# end of regex
';';
在 URL 不包含其他引号的简单情况下使用它可以正常工作:
Using this with a simple case where the URL doesn't contain other quotes works fine:
1.2.3.4 - - [15/Apr/2005:20:35:37 +0200] "GET /\ foo=bat\ HTTP/1.0" 400 299 "-" "-" "-"
现在我试图获得对无、一次或多次 \"
的支持,但找不到解决方案.使用 regexpal.com 我想出了这个到目前为止:
Now I'm trying to get support for none, one or more occurrences of \"
into it, but can't find a solution. Using regexpal.com I've came up with this so far:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*"
这里只是改动的部分:
# request uri
'"(.|\\(?="))*"' .
不过,太贪心了.它吃掉最后一个 "
之前的所有东西,而它应该只吃到第一个 "
之前没有 \
.我还尝试在我想要的 "
之前引入没有 \
的要求,但它仍然吃到字符串的末尾(注意:我不得不添加无关的 >\
字符以使其在 PHP 中工作):
However, it's too greedy. It eats everything until the last "
, when it should only eat until the first "
not preceded by a \
. I also tried introducing the requirement that there's no \
before the "
I want, but it still eats to the end of the string (Note: I had to add extraneous \
characters to make this work in PHP):
# request uri
'"(.|\\(?="))*[^\\\\]"' .
但是我突然想到:*?
:如果在任何量词 、+、? 或 {} 之后立即使用,会使量词非-greedy(匹配最少次数)
# request uri
'"(.|\\(?="))*?[^\\\\]"' .
完整的正则表达式:
^(\d+\.\d+\.\d+\.\d+)\s+[^\s]+\s+[^\s]+\s+\[(\d+)/([A-Za-z]+)/(\d+):(\d+):(\d+):(\d+)\s+\+\d+\]\s+"(.|\\(?="))*?[^\\]"\s+(\d+)\s+(\d+|-)
2009 年 5 月 5 日更新:
由于解析了数百万行,我在正则表达式中发现了一个小缺陷:它在双引号之前包含反斜杠字符的行中断.换句话说:
I discovered a small flaw in the regexp due parsing millions of lines: it breaks on lines which contain the backslash character right before the double quote. In other words:
...\\"
会破坏正则表达式.Apache 不会记录 ...\"
但总是将反斜杠转义为 \\
,因此可以安全地假设双引号前有两个反斜杠字符.
will break the regex. Apache will not log ...\"
but will always escape the backslash to \\
, so it's safe to assume that when there're two backslash characters before the double quote.
有人知道如何用正则表达式解决这个问题吗?
Anyone has an idea how to fix this with the the regex?
有用的资源:developer.mozilla.org 上的 JavaScript Regexp 文档 和 regexpal.com
推荐答案
试试这个:
"(?:[^\\"]+|\\.)*"
此正则表达式匹配双引号字符后跟除\
和"
以外的任何字符序列或转义序列\
α
(其中 α
可以是任何字符)后跟最后一个双引号字符.(?:
expr
)
语法只是一个非捕获组.
This regular expression matches a double quote character followed by a sequence of either any character other than \
and "
or an escaped sequence \
α
(where α
can be any character) followed by the final double quote character. The (?:
expr
)
syntax is just a non-capturing group.
这篇关于在引号之间匹配的正则表达式,包含转义引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!