将PCRE正则表达式转换为emacs正则表达式的Elisp机制 [英] Elisp mechanism for converting PCRE regexps to emacs regexps

查看:273
本文介绍了将PCRE正则表达式转换为emacs正则表达式的Elisp机制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我承认倾向于偏爱 PCRE 的重要偏见比emacs好得多,如果没有其他原因,当我键入一个'('我几乎总是想要一个分组运算符,当然,\w和类似的东西比其他等价物更方便。



但是当然,希望改变emacs的内部设备会变得很疯狂,但是应该可以将PCRE experssion转换为emacs表达式,我想,并进行所有需要的转换,以便我可以写: p>

 (defun my-super-regexp-function ... 
(search-forward(pcre-convert__\ \w:\d +)))

(或类似)



任何人都知道可以做到这一点的elisp库?






编辑:从下面的答案中选择一个答复...



哇,我喜欢从4天的假期回来,找到一大堆兴趣g答案排序!我喜欢进入这两种类型的解决方案的工作。



最后,看起来像解决方案的exec-a-script和直线elisp版本两个工作,但是从纯粹的速度和正确性的方法,elisp版本肯定是人们喜欢的(我自己包括)。

解决方案

这是一个快速和丑陋的Emacs lisp解决方案(编辑:现在更加永久地here )。它主要基于 pcrepattern 手册页中的描述,并通过令牌进行令牌转换,仅转换以下结构:




  • 括号分组(..)

  • 交替 |

  • 数字重复 {M,N}

  • string quoting \Q .. \E

  • 简单字符转义: \a \c \e \f \\\
    \r \t \x \ +八进制数字

  • 字符类: \d \D \h \H \s \S \v \V

  • code> \w 和 \W 保持原样(使用Emacs自己的单词和非字符字符的想法)



它没有做更多复杂的PCRE断言的任何事情,但它尝试转换字符类内的转义。在包含类似 \D 的字符类的情况下,这是通过转换成非捕获组进行交替的。



它传递了我为其编写的测试,但是肯定有错误,扫描令牌的方法可能很慢。换句话说,没有保修。但是,也许为了某些目的,这项工作可以做得足够简单。请求有兴趣的人加以改进; - )

 (eval-when-compile(require'cl))

(defvar pcre-horizo​​ntal-whitespace-chars
(mapconcat'char-to-string
'(#x0009#x0020#x00A0#x1680#x180E#x2000#x2001#x2002#x2003
#x2004#x2005#x2006#x2007#x2008#x2009#x200A#x202F
#x205F#x3000)
))

(defvar pcre-vertical-空白字符
(mapconcat'char-to-string
'(#x000A#x000B#x000C#x000D#x0085#x2028#x2029)))

(defvar pcre-whitespace-chars
(mapconcat'char-to-string'(9 10 12 13 32)))

(defvar pcre-horizo​​ntal-whitespace
[pcre-horizo​​ntal-whitespace-chars]))

(defvar pcre-non-horizo​​ntal-whitespace
(concat[^pcre-horizo​​ntal-whitespace-chars ]))

(defvar pcre-vertical-whitespace
(concat[pcre-vertical-whitespace-ch ars]))

(defvar pcre-non-vertical-whitespace
(concat[^pcre-vertical-whitespace-chars]))

(defvar pcre-whitespace(concat[pcre-whitespace-chars]))

(defvar pcre-non-whitespace(concat[^pcre-whitespace-chars ]))

(eval-when-compile
(defmacro pcre-token-case(& rest cases)
在点处消耗一个令牌并评估相应的表单。

CASES是一个cond式子句列表(REGEXP FORMS
...)。考虑到CASES的顺序,如果文本在点匹配
REGEXP然后移动匹配的字符串上的点,并返回
值的FORMS。如果没有一个CASES匹配,则返回`nil'。
(声明(& rest(sexp& rest form))))
`(cond
,@(mapcar
(lambda(case)
(let((token(car case))
(action(cdr case)))
`((look-in,token))
(goto-char(match-end 0))
,@ action)))
个案)
(t nil)))

(defun pcre -to-elisp(pcre)
将PCRE,一个正则表达式转换为Elite字符串形式。
(with-temp-buffer
(insert pcre)
goto-char(point-min))
(let((capture-count 0)(accum'())
(case-fold-search nil))
(while eobp))
(let((
;;
;;
;;处理相同的令牌,
;;字符类
(pcre-re或-class-token-to-elisp)

;;其他令牌
(pcre-token-case
(|\\ |)
(((incf capture-count)\\()
)\\)
({\\ {)
(}\\})

;;字符类
(\\ [(pcre-char-class-to-elisp))

;;反斜杠+数位=>反向引用或八进制字符?
(\\\\\\([0-9] + \\)
(let *((digits(match-string 1))
(dec(string-to-number digits)))
;; fromman pcrepattern:如果数字是
;;小于10,或者如果
;至少有很多以前的捕获剩下的
;;表达式中的括号,整个
;;序列被作为一个反向引用
(cond((< dec 10) \\数字))
((> = capture-count dec)
(错误backreference \\%s不能用于Emacs regexps
数字))
(t
;;来自man pcrepattern:如果
;;十进制数大于9和
;;没有那么多
;;捕获子模式,PCRE重新读取
;;
;;之后最多三个八进制数字反斜杠,并使用它们
;;生成数据字符。任何
;;后续数字代表
;;他们自己。
(goto-char(match-beginning 1))
(re-search-forward[0-7] \\ {0,3\\})
(char-to-string(string-to-number(match-string 0)8)))))))

;;正则表达式引用。
(\\\\\Q
(let((开始(点)))
(搜索转发\\E)
(regexp-quote(buffer-substring beginning(match-beginning 0)))))

;;各种字符类
(\\\\\d[0-9])
(\\\\\D[^ 0-9] )
(\\\\\hpcre-horizo​​ntal-whitespace)
(\\\\\Hpcre-non-horizo​​ntal-whitespace)
(\\\spcre-whitespace)
(\\\\\Spcre-non-whitespace)
(\\ \\vpcre-vertical-whitespace)
(\\\\\Vpcre-non-vertical-whitespace)

;;使用Emacs本机的字符字符
(\\\\ [Ww](match-string 0))

;;任何其他转义字符
(\\\\\\\(.\\)(regexp-quote(match-string 1)))

;;任何正常的字符
(。(match-string 0))))))
(推翻译为accum)))
(apply'concat(reverse accum))))

(defun pcre-re-or-class-token-to-elisp()
在点处使用PCRE令牌并返回其Elisp等价物

句柄
(pcre-token-case
(\\\\(char-to- string#x07)); bell
(\\\\\\\\(.\\);控制字符
(char-to-string
( - (string-to-char(upcase(match-string 1)))64)))
(\\\\\e(char-to-string#x1b)); escape
(\\\\\f(char-to-string#x0c)); formfeed
(\\\\\\ string \\ x0a)); linefeed
(\\\\\r(char-to-string#x0d));回车符
(\\\\ t(c har-to-string#x09)); tab
(\\\\x\\((A-Za-z0-9)\\ {2\\ } \\)
(char-to-string(string-to-number(match-string 1)16)))
(\\\\x \\\([A-Za-z0-9] * \\)}
(char-to-string(string-to-number(match-string 1)16)))))

(defun pcre-char-class-to-elisp()
在点处消耗剩余的PCRE字符类并返回其Elisp等效项。

当这个调用时,点应该是在打开之后的\,
将在它返回结束后关闭\] \。
(let((accum'([))
(pcre-char-class-alternatives'())
(否定为零))
(when(looking- \\ ^)
(setq negated t)
(push^accum)
(forward-char))
(when(looking-at $($($)$($($)$)$($($)$($) ((
(或
(pcre-re-or-class-token-to-elisp))
(pcre-token-case
;;反斜杠+数位=>总是一个八进制char
(\\\\\\([0-7] \\ {1,3\\} \\)
(char-to-string(string-to-number(match-string 1)8)))

;;各种字符类。我们把它们放在列表`pcre-char-class-alternatives'和
;;上。将char类转换成一个害羞的组,其中包含
(\\\\0-9)
(\\\\D(push (如果否定[0-9][^ 0-9])
pcre-char-class-alternatives))
(\\\\h pcre-horizo​​ntal-whitespace-chars)
(\\\\H(推(如果否定
pcre-horizo​​ntal-whitespace
pcre-non-horizo​​ntal-whitespace)
pcre-char-class-alternatives))
(\\\\\spcre-whitespace-chars)
(\\\\\ \\(推(如果否定
pcre-whitespace
pcre-non-whitespace)
pcre-char-class-alternatives))
(\\ \\vpcre-ve rical-whitespace-chars)
(\\\\V(推(如果否定
pcre-vertical-whitespace
pcre-non-vertical-whitespace)
pcre-char-class-alternatives))
(\\\\w(推(如果否定为\\W\\w)
pcre-char-class-alternatives))
(\\\\W(推(如果否定为\\\w\\W)
pcre-char-class-alternatives))

;;保持POSIX语法不变
(\\ [:[a-z] *:\\](match-string 0))

;;忽略其他转义
(\\\\\\\(.\\)(match-string 0)

;;复制一切
(。(match-string 0))))))
(推翻翻译的累加)))
(push]accum)
-char)
(let((class
(apply'concat(reverse accum))))
(when(or(equal class[])
$ {
(setq class))
(if(not pcre-char-class-alternatives)
class
(concat\\ (?:
class\\ |
(mapconcat'identity
pcre-char-class-alternatives
\\ |)
\\))))))


I admit significant bias toward liking PCRE regexps much better than emacs, if no no other reason that when I type a '(' I pretty much always want a grouping operator. And, of course, \w and similar are SO much more convenient than the other equivalents.

But it would be crazy to expect to change the internals of emacs, of course. But it should be possible to convert from a PCRE experssion to an emacs expression, I'd think, and do all the needed conversions so I can write:

(defun my-super-regexp-function ...
   (search-forward (pcre-convert "__\\w: \d+")))

(or similar).

Anyone know of a elisp library that can do this?


Edit: Selecting a response from the answers below...

Wow, I love coming back from 4 days of vacation to find a slew of interesting answers to sort through! I love the work that went into the solutions of both types.

In the end, it looks like both the exec-a-script and straight elisp versions of the solutions would both work, but from a pure speed and "correctness" approach the elisp version is certainly the one that people would prefer (myself included).

解决方案

Here's a quick and ugly Emacs lisp solution (EDIT: now located more permanently here). It's based mostly on the description in the pcrepattern man page, and works token by token, converting only the following constructions:

  • parenthesis grouping ( .. )
  • alternation |
  • numerical repeats {M,N}
  • string quoting \Q .. \E
  • simple character escapes: \a, \c, \e, \f, \n, \r, \t, \x, and \ + octal digits
  • character classes: \d, \D, \h, \H, \s, \S, \v, \V
  • \w and \W left as they are (using Emacs' own idea of word and non-word characters)

It doesn't do anything with more complicated PCRE assertions, but it does try to convert escapes inside character classes. In the case of character classes including something like \D, this is done by converting into a non-capturing group with alternation.

It passes the tests I wrote for it, but there are certainly bugs, and the method of scanning token-by-token is probably slow. In other words, no warranty. But perhaps it will do enough of the simpler part of the job for some purposes. Interested parties are invited to improve it ;-)

(eval-when-compile (require 'cl))

(defvar pcre-horizontal-whitespace-chars
  (mapconcat 'char-to-string
             '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                      #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                      #x205F #x3000)
             ""))

(defvar pcre-vertical-whitespace-chars
  (mapconcat 'char-to-string
             '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))

(defvar pcre-whitespace-chars
  (mapconcat 'char-to-string '(9 10 12 13 32) ""))

(defvar pcre-horizontal-whitespace
  (concat "[" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-non-horizontal-whitespace
  (concat "[^" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-vertical-whitespace
  (concat "[" pcre-vertical-whitespace-chars "]"))

(defvar pcre-non-vertical-whitespace
  (concat "[^" pcre-vertical-whitespace-chars "]"))

(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))

(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))

(eval-when-compile
  (defmacro pcre-token-case (&rest cases)
    "Consume a token at point and evaluate corresponding forms.

CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
    (declare (debug (&rest (sexp &rest form))))
    `(cond
      ,@(mapcar
         (lambda (case)
           (let ((token (car case))
                 (action (cdr case)))
             `((looking-at ,token)
               (goto-char (match-end 0))
               ,@action)))
         cases)
      (t nil))))

(defun pcre-to-elisp (pcre)
  "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
  (with-temp-buffer
    (insert pcre)
    (goto-char (point-min))
    (let ((capture-count 0) (accum '())
          (case-fold-search nil))
      (while (not (eobp))
        (let ((translated
               (or
                ;; Handle tokens that are treated the same in
                ;; character classes
                (pcre-re-or-class-token-to-elisp)   

                ;; Other tokens
                (pcre-token-case
                 ("|" "\\|")
                 ("(" (incf capture-count) "\\(")
                 (")" "\\)")
                 ("{" "\\{")
                 ("}" "\\}")

                 ;; Character class
                 ("\\[" (pcre-char-class-to-elisp))

                 ;; Backslash + digits => backreference or octal char?
                 ("\\\\\\([0-9]+\\)"
                  (let* ((digits (match-string 1))
                         (dec (string-to-number digits)))
                    ;; from "man pcrepattern": If the number is
                    ;; less than 10, or if there have been at
                    ;; least that many previous capturing left
                    ;; parentheses in the expression, the entire
                    ;; sequence is taken as a back reference.   
                    (cond ((< dec 10) (concat "\\" digits))
                          ((>= capture-count dec)
                           (error "backreference \\%s can't be used in Emacs regexps"
                                  digits))
                          (t
                           ;; from "man pcrepattern": if the
                           ;; decimal number is greater than 9 and
                           ;; there have not been that many
                           ;; capturing subpatterns, PCRE re-reads
                           ;; up to three octal digits following
                           ;; the backslash, and uses them to
                           ;; generate a data character. Any
                           ;; subsequent digits stand for
                           ;; themselves.
                           (goto-char (match-beginning 1))
                           (re-search-forward "[0-7]\\{0,3\\}")
                           (char-to-string (string-to-number (match-string 0) 8))))))

                 ;; Regexp quoting.
                 ("\\\\Q"
                  (let ((beginning (point)))
                    (search-forward "\\E")
                    (regexp-quote (buffer-substring beginning (match-beginning 0)))))

                 ;; Various character classes
                 ("\\\\d" "[0-9]")
                 ("\\\\D" "[^0-9]")
                 ("\\\\h" pcre-horizontal-whitespace)
                 ("\\\\H" pcre-non-horizontal-whitespace)
                 ("\\\\s" pcre-whitespace)
                 ("\\\\S" pcre-non-whitespace)
                 ("\\\\v" pcre-vertical-whitespace)
                 ("\\\\V" pcre-non-vertical-whitespace)

                 ;; Use Emacs' native notion of word characters
                 ("\\\\[Ww]" (match-string 0))

                 ;; Any other escaped character
                 ("\\\\\\(.\\)" (regexp-quote (match-string 1)))

                 ;; Any normal character
                 ("." (match-string 0))))))
          (push translated accum)))
      (apply 'concat (reverse accum)))))

(defun pcre-re-or-class-token-to-elisp ()
  "Consume the PCRE token at point and return its Elisp equivalent.

Handles only tokens which have the same meaning in character
classes as outside them."
  (pcre-token-case
   ("\\\\a" (char-to-string #x07))  ; bell
   ("\\\\c\\(.\\)"                  ; control character
    (char-to-string
     (- (string-to-char (upcase (match-string 1))) 64)))
   ("\\\\e" (char-to-string #x1b))  ; escape
   ("\\\\f" (char-to-string #x0c))  ; formfeed
   ("\\\\n" (char-to-string #x0a))  ; linefeed
   ("\\\\r" (char-to-string #x0d))  ; carriage return
   ("\\\\t" (char-to-string #x09))  ; tab
   ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
    (char-to-string (string-to-number (match-string 1) 16)))
   ("\\\\x{\\([A-Za-z0-9]*\\)}"
    (char-to-string (string-to-number (match-string 1) 16)))))

(defun pcre-char-class-to-elisp ()
  "Consume the remaining PCRE character class at point and return its Elisp equivalent.

Point should be after the opening \"[\" when this is called, and
will be just after the closing \"]\" when it returns."
  (let ((accum '("["))
        (pcre-char-class-alternatives '())
        (negated nil))
    (when (looking-at "\\^")
      (setq negated t)
      (push "^" accum)
      (forward-char))
    (when (looking-at "\\]") (push "]" accum) (forward-char))

    (while (not (looking-at "\\]"))
      (let ((translated
             (or
              (pcre-re-or-class-token-to-elisp)
              (pcre-token-case              
               ;; Backslash + digits => always an octal char
               ("\\\\\\([0-7]\\{1,3\\}\\)"    
                (char-to-string (string-to-number (match-string 1) 8)))

               ;; Various character classes. To implement negative char classes,
               ;; we cons them onto the list `pcre-char-class-alternatives' and
               ;; transform the char class into a shy group with alternation
               ("\\\\d" "0-9")
               ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
                              pcre-char-class-alternatives) "")
               ("\\\\h" pcre-horizontal-whitespace-chars)
               ("\\\\H" (push (if negated
                                  pcre-horizontal-whitespace
                                pcre-non-horizontal-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\s" pcre-whitespace-chars)
               ("\\\\S" (push (if negated
                                  pcre-whitespace
                                pcre-non-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\v" pcre-vertical-whitespace-chars)
               ("\\\\V" (push (if negated
                                  pcre-vertical-whitespace
                                pcre-non-vertical-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\w" (push (if negated "\\W" "\\w") 
                              pcre-char-class-alternatives) "")
               ("\\\\W" (push (if negated "\\w" "\\W") 
                              pcre-char-class-alternatives) "")

               ;; Leave POSIX syntax unchanged
               ("\\[:[a-z]*:\\]" (match-string 0))

               ;; Ignore other escapes
               ("\\\\\\(.\\)" (match-string 0))

               ;; Copy everything else
               ("." (match-string 0))))))
        (push translated accum)))
    (push "]" accum)
    (forward-char)
    (let ((class
           (apply 'concat (reverse accum))))
      (when (or (equal class "[]")
                (equal class "[^]"))
        (setq class ""))
      (if (not pcre-char-class-alternatives)
          class
        (concat "\\(?:"
                class "\\|"
                (mapconcat 'identity
                           pcre-char-class-alternatives
                           "\\|")
                "\\)")))))

这篇关于将PCRE正则表达式转换为emacs正则表达式的Elisp机制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆