Python 正则表达式无法按预期工作 [英] Python Regex doesn't work as expected

查看:45
本文介绍了Python 正则表达式无法按预期工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我精心设计了这个正则表达式:

<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>

解析以下RSS Feed:

<?xml version="1.0" encoding="UTF-8"?>\n<feed version="0.3" xmlns="http://purl.org/atom/ns#">\n<title>Gmail - Inbox for g.bargelli@gmail.com</title>\n<tagline>New messages in your Gmail Inbox</tagline>\n<fullcount>2</fullcount>\n<link rel="alternate" href="http://mail.google.com/mail" type="text/html" />\n<modified>2011-03-15T11:07:48Z</modified>\n<entry>\n<title>con due mail...</title>\n<summary>Gianluca Bargelli http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb9332c2c1fa27&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:07:42Z</modified>\n<issued>2011-03-15T11:07:42Z</issued>\n<id>tag:gmail.google.com,2004:1363345158434847271</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n<entry>\n<title>test nuova mail</title>\n<summary>Gianluca Bargelli sono tornato!?& http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb93140d9f7627&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:05:36Z</modified>\n<issued>2011-03-15T11:05:36Z</issued>\n<id>tag:gmail.google.com,2004:1363345026546890279</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n</feed>\n'skinner.com/products/spl].

问题是我没有通过使用 Python 的 re 模块获得任何匹配:

The problem is that i am not getting any matches by using Python's re module:

import re

regex = re.compile("""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")
regex.findall(rss_string) # Returns an empty list

使用在线正则表达式测试器(例如 this)按预期工作,所以我认为不是正则表达式问题.

Using an online regex tester (such as this) works as expected, so i don't think is a regex problem.

我很清楚使用正则表达式来解析上下文无关语法是不好的,但在我的情况下,正则表达式可能适用于该 RSS 提要(它是 Gmail 收件箱)顺便提一下供稿),我知道我可以使用外部库/xml 解析器来完成这项任务:这只是一种练习,而不是一种习惯.

I am well aware that using regular expressions to parse a Context-Free Grammar is BAD, but in my case the regular expression is likely to work only for that RSS feed (it is a Gmail inbox feed, by the way) and i know i can use an external library/xml parser for this task: it is only an exercise, not an habit.

问题应该是为什么下面的正则表达式在 Python 中不能按预期工作?

推荐答案

在正则表达式编译器看到字符串之前,Python 已经处理了斜线转义符,因此您必须对其进行两次转义(例如 \\\\n 表示 \\n).然而,Python 对这类事情有一个方便的表示法,只需在字符串前添加一个 r:

Before the regex compiler sees a string, Python has already processed the slash-escapes, therefore you'd have to escape it twice (e.g. \\\\n for \\n). However, Python has a handy notation for exactly this sort of thing, just stick an r before the string:

regex = re.compile(r"""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")

顺便说一句,我同意这里的其他人,不要使用正则表达式来解析 XML.但是,希望您会发现此字符串表示法在以后的正则表达式中很有用.

By the way, I agree with the others here, do not use regexes to parse XML. However, hopefully you will find this string notation helpful in future regular expressions.

这篇关于Python 正则表达式无法按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆