python regex无法识别降价链接 [英] python regex fails to identify markdown links

查看:40
本文介绍了python regex无法识别降价链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在python中编写正则表达式以在Markdown文本字符串中找到网址.找到网址后,我想检查是否由markdown链接包裹:text我对后者有疑问.我正在使用正则表达式-link_exp-进行搜索,但结果不是我所期望的,并且无法解决这个问题.

I am trying to write a regex in python to find urls in a Markdown text string. Once a url is found, I want to check if this is wrapped by a markdown link: text I am having problem with the latter. I am using a regex - link_exp - to search, but the results are not what I expected, and cannot get my head around it.

这可能是我没看到的简单事情.

This is probably something simple that I am not seeing.

这是link_exp正则表达式的代码和解释

here goes the code and explanation of the link_exp regex

import re

text = '''
[Vocoder](http://en.wikipedia.org/wiki/Vocoder )
[Turing]( http://en.wikipedia.org/wiki/Alan_Turing)
[Autotune](http://en.wikipedia.org/wiki/Autotune)
http://en.wikipedia.org/wiki/The_Voder
'''

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) #find all urls
for url in urls:
    url = re.escape(url)
    link_exp = re.compile('\[.*\]\(\s*{0}\s*\)'.format(url) ) # expression with url wrapped in link syntax.     
    search = re.search(link_exp, text)
    if search != None:
        print url

# expression should translate to:
# \[ - literal [
# .* - any character or no character 
# \] - literal ]
# \( - literal (
# \s* - whitespaces or no whitespace 
# {0} - the url
# \s* - whitespaces or no whitespace 
# \) - literal )
# NOTE: I am including whitespaces to encompass cases like [foo]( http://www.foo.sexy   )  

我得到的输出仅仅是:

http\:\/\/en\.wikipedia\.org\/wiki\/Vocoder

表示表达式仅在右括号前找到带有空格的链接.这不仅是我想要的,而且应该只考虑一种不带空格的案例链接.

which means the expression is only finding the link with a whitespace before the closing parenthesis. This is not only what I want to, but only one case links without white spaces should be considered.

您认为可以帮助我吗?
欢呼

Do you think you can help me on this one?
cheers

推荐答案

这里的问题是您的正则表达式,用于首先提取URL的正则表达式,其中包括URL中的).这意味着您正在寻找右括号两次.除了第一个以外,所有事物都会发生这种情况(空格可以将您保存在那里).

The problem here is your regex for pulling out the URL's in the first place, which is including ) inside the URLs. This means you are looking for the closing parenthesis twice. This happens for everything bar the first one (the space saves you there).

我不太确定您的网址正则表达式的每个部分都在尝试做什么,但是该部分说: [$ -_ @.& +] 包括从 $ (ASCII 36)到 _ (ASCII 137)的范围包含了大量您可能不需要的字符,包括).

I'm not quite sure what each part of your URL regex is trying to do, but the portion that says: [$-_@.&+], is including a range from $ (ASCII 36) to _ (ASCII 137), which includes a huge number of characters you probably don't mean, including the ).

与其查找URL,然后检查它们是否在链接中,不如同时查找两个URL?这样,您的URL正则表达式就可以变得更懒惰了,因为额外的约束使其不太可能成为其他任何东西:

Instead of looking for URLs, and then checking to see if they are in the link, why not do both at once? This way your URL regex can be lazier, because the extra constraints make it less likely to be anything else:

# Anything that isn't a square closing bracket
name_regex = "[^]]+"
# http:// or https:// followed by anything but a closing paren
url_regex = "http[s]?://[^)]+"

markup_regex = '\[({0})]\(\s*({1})\s*\)'.format(name_regex, url_regex)

for match in re.findall(markup_regex, text):
    print match

结果:

('Vocoder', 'http://en.wikipedia.org/wiki/Vocoder ')
('Turing', 'http://en.wikipedia.org/wiki/Alan_Turing')
('Autotune', 'http://en.wikipedia.org/wiki/Autotune')

如果需要更严格,则可以改善URL正则表达式.

You could probably improve the URL regex if you need to be stricter.

这篇关于python regex无法识别降价链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆