使用 python 的 re 模块正确解析字符串文字 [英] Correctly parsing string literals with python's re module

查看:39
本文介绍了使用 python 的 re 模块正确解析字符串文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为我用 Python 编写的 javascript 预处理器添加一些轻微的降价支持.

在大多数情况下它是有效的,但有时我使用的正则表达式有点奇怪,我认为它与原始字符串和转义序列有关.

正则表达式为:(?<!\\)\"[^\"]+\"

是的,我知道它只匹配以 " 字符开头的字符串.但是,这个项目的诞生最重要的是出于好奇,所以我现在可以接受它.

分解:

(?<\\)\" # 组应以未转义的引号开头[^\"]+ # 并匹配任意数量的至少一个不是引号的字符(这是最大的问题,我知道)\" # 并在它找到的第一个引号处结束

话虽如此,我(显然)开始遇到这样的问题:

"这是一个字符串,里面有一个\"转义引号\"

我不太确定如何说除了引号之外的所有内容,除非该标记被转义".我试过了:

([^\"]|\\\")+ # 一组除了引号或转义引号之外的任何内容

,但这会导致非常奇怪的结果.

我已经完全准备好听到我将这一切都错了.为简单起见,假设此正则表达式始终以双引号 (") 开头和结尾,以避免在混合中添加另一个元素.我真的很想了解到目前为止我所拥有的.

感谢您的帮助.

编辑

作为对正则表达式的测试,我尝试使用以下代码(使用下面的 unutbu 模式)在缩小的 jQuery 脚本中找到所有字符串文字:

STRLIT = r'''(?x) # 详细模式(?<!\\) # 前面没有反斜杠" # 文字双引号.*?# 非贪婪的 1 个或多个字符(?<!\\) # 前面没有反斜杠" # 文字双引号'''f = open("jquery.min.js","r")jq = f.read()f.close()文字 = re.findall(STRLIT,jq)

下面的答案几乎解决了所有问题.确实出现的那些是在 jquery 自己的正则表达式中,这是一个非常边缘的情况.该解决方案不再将有效的 javascript 错误识别为降价链接,这才是真正的目标.

解决方案

也许使用两个负面的回看:

导入重新text = r'''"这是一个带有\"转义引号\"的字符串".而 ""===r?+r:wt.test(r)?st.parseJSON(r) :r}catch(o){}st.data(e,n,r)}else r=t}returnr}function s(e){var t;for(t in e)if(("data" '''for match in (re.findall(r'''(?x) # 详细模式(?<!\\) # 前面没有反斜杠" # 文字双引号.*?# 1 个或多个字符(?<!\\) # 前面没有反斜杠" # 文字双引号''', 文本)):打印(匹配)

收益

"这是一个字符串,里面有一个\"转义引号\"""数据"

.+? 中的问号使模式非贪婪.非贪婪导致模式在遇到第一个未转义的双引号时匹配.

I'm trying to add some light markdown support for a javascript preprocessor which I'm writing in Python.

For the most part it's working, but sometimes the regex I'm using is acting a little odd, and I think it's got something to do with raw-strings and escape sequences.

The regex is: (?<!\\)\"[^\"]+\"

Yes, I am aware that it only matches strings beginning with a " character. However, this project is born out of curiosity more than anything, so I can live with it for now.

To break it down:

(?<\\)\"    # The group should begin with a quotation mark that is not escaped
[^\"]+      # and match any number of at least one character that is not a quotation mark (this is the biggest problem, I know)
\"          # and end at the first quotation mark it finds

That being said, I (obviously) start hitting problems with things like this:

"This is a string with an \"escaped quote\" inside it"

I'm not really sure how to say "Everything but a quotation mark, unless that mark is escaped". I tried:

([^\"]|\\\")+     # a group of anything but a quote or an escaped quote

, but that lead to very strange results.

I'm fully prepared to hear that I'm going about this all wrong. For the sake of simplicity, let's say that this regex will always start and end with double quotes (") to avoid adding another element in the mix. I really want to understand what I have so far.

Thanks for any assistance.

EDIT

As a test for the regex, I'm trying to find all string literals in the minified jQuery script with the following code (using the unutbu's pattern below):

STRLIT = r'''(?x)   # verbose mode
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    .*?        # non-greedy 1-or-more characters
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    ''' 
f = open("jquery.min.js","r")
jq = f.read()
f.close()
literals = re.findall(STRLIT,jq)

The answer below fixes almost all issues. The ones that do arise are within jquery's own regular expressions, which is a very edge case. The solution no longer misidentifies valid javascript as markdown links, which was really the goal.

解决方案

Perhaps use two negative look behinds:

import re

text = r'''"This is a string with an \"escaped quote\" inside it". While ""===r?+r:wt.test(r)?st.parseJSON(r)    :r}catch(o){}st.data(e,n,r)}else r=t}return r}function s(e){var t;for(t in e)if(("data" '''

for match in (re.findall(r'''(?x)   # verbose mode
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    .*?        # 1-or-more characters
    (?<!\\)    # not preceded by a backslash
    "          # a literal double-quote
    ''', text)):
    print(match)

yields

"This is a string with an \"escaped quote\" inside it"
""
"data"

The question mark in .+? makes the pattern non-greedy. The non-greediness causes the pattern to match when it encounters the first unescaped double quotation mark.

这篇关于使用 python 的 re 模块正确解析字符串文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆