Python正则表达式匹配单引号中的文本,忽略转义引号(和制表符/换行符) [英] Python regex to match text in single quotes, ignoring escaped quotes (and tabs/newlines)

查看:234
本文介绍了Python正则表达式匹配单引号中的文本,忽略转义引号(和制表符/换行符)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个文本文件,其中我要匹配的字符由单引号分隔,但可能有零个或一个转义单引号,以及零个或多个制表符和换行符(未转义)-我只想匹配文本.示例:

Given a file of text, where the character I want to match are delimited by single-quotes, but might have zero or one escaped single-quote, as well as zero or more tabs and newline characters (not escaped) - I want to match the text only. Example:

menu_item = 'casserole';
menu_item = 'meat 
            loaf';
menu_item = 'Tony\'s magic pizza';
menu_item = 'hamburger';
menu_item = 'Dave\'s famous pizza';
menu_item = 'Dave\'s lesser-known
    gyro';

我只想获取文本(和空格),忽略制表符/换行符 - 我实际上并不关心转义引号是否出现在结果中,只要它不影响匹配:

I want to grab only the text (and spaces), ignoring the tabs/newlines - and I don't actually care if the escaped quote appears in the results, as long as it doesn't affect the match:

casserole
meat loaf
Tonys magic pizza
hamburger
Daves famous pizza
Dave\'s lesser-known gyro # quote is okay if necessary.

我设法创建了一个几乎的正则表达式 - 它处理转义的引号,但不处理换行符:

I have manage to create a regex that almost does it - it handles the escaped quotes, but not the newlines:

menuPat = r"menu_item = \'(.*)(\\\')?(\t|\n)*(.*)\'"
for line in inFP.readlines():
    m = re.search(menuPat, line)
    if m is not None:
        print m.group()

肯定有大量的正则表达式问题 - 但大多数都在使用 Perl,如果有一个可以做我想要的,我无法弄清楚:) 而且由于我使用的是 Python,我不不在乎它是否分布在多个组中,很容易将它们重新组合.

There are definitely a ton of regular expression questions out there - but most are using Perl, and if there's one that does what I want, I couldn't figure it out :) And since I'm using Python, I don't care if it is spread across multiple groups, it's easy to recombine them.

有些答案说只用代码来解析文本.虽然我确定我可以这样做 - 我非常接近有一个有效的正则表达式:)而且看起来它应该可行.

Some Answers have said to just go with code for parsing the text. While I'm sure I could do that - I'm so close to having a working regex :) And it seems like it should be doable.

更新:我刚刚意识到我正在做一个 Python readlines() 来获取每一行,这显然打破了传递给正则表达式的行.我正在考虑重新编写它,但关于这部分的任何建议也会非常有帮助.

Update: I just realized that I am doing a Python readlines() to get each line, which obviously is breaking up the lines getting passed to the regex. I'm looking at re-writing it, but any suggestions on that part would also be very helpful.

推荐答案

应该这样做:

menu_item = '((?:[^'\\]|\\')*)'

这里的 (?:[^'\\]|\\')* 部分匹配除了 '\ 之外的任何字符的任何序列code> 或文字 \'.前一个表达式 [^'\\] 也允许使用换行符和制表符,然后您需要将其替换为单个空格.

Here the (?:[^'\\]|\\')* part matches any sequence of any character except ' and \ or a literal \'. The former expression [^'\\] does also allow line breaks and tabulators that you then need to replace by a single space.

这篇关于Python正则表达式匹配单引号中的文本,忽略转义引号(和制表符/换行符)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆