如何使用正则表达式查找所有Markdown链接? [英] How can I find all Markdown links using regular expressions?

查看:703
本文介绍了如何使用正则表达式查找所有Markdown链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Markdown中,有两种放置链接的方法,一种是只键入原始链接,例如:http://example.com,另一种是使用()[]语法:(Stack Overflow)[http://example.com ].

In Markdown there is two ways to place a link, one is to just type the raw link in, like: http://example.com, the other is to use the ()[] syntax: (Stack Overflow)[http://example.com ].

我正在尝试编写一个可以同时匹配这两个规则的正则表达式,并且,如果这是第二个匹配项,则还可以捕获显示字符串.

I'm trying to write a regular expression that can match both of these, and, if it's the second match to also capture the display string.

到目前为止,我有这个:

So far I have this:

(?P<href>http://(?:www\.)?\S+.com)|(?<=\((.*)\)\[)((?P=href))(?=\])

Debuggex演示

但这似乎与我在Debuggex中的两个测试用例都不匹配:

But this doesn't seem to match either of my two test cases in Debuggex:

http://example.com
(Example)[http://example.com]

真的不确定为什么第一个至少不匹配,这与我使用命名组有关吗?可能的话,我想继续使用它,因为这是匹配链接的简化表达式,并且在实际示例中,让我感到很舒服,无法以相同的方式在两个不同的位置复制它.

Really not sure why the first one isn't matched at the very least, is it something to do with my use of the named group? Which, if possible I'd like to keep using because this is a simplified expression to match the link and in the real example it is too long for me to feel comfortable duplicating it in two different places in the same pattern.

我做错了什么?还是根本不可行?

What am I doing wrong? Or is this not doable at all?

编辑:我正在Python中执行此操作,因此将使用其正则表达式引擎.

I'm doing this in Python so will be using their regex engine.

推荐答案

您的模式不起作用的原因是:(?<=\((.*)\)\[),因为Python的re模块不允许在后面进行变长查找.

The reason your pattern doesn't work is here: (?<=\((.*)\)\[) since the re module of Python doesn't allow variable length lookbehind.

您可以使用 Python的新正则表达式模块以更方便的方式获得所需的内容 (因为re模块的功能较少).

示例:(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])

在线演示

图案细节:

(?|                                       # open a branch reset group
    # first case there is only the url
    (?<txt>                               # in this case, the text and the url  
        (?<url>                           # are the same
            (?:ht|f)tps?://\S+(?<=\P{P})
        )
    )
  |                                       # OR
    # the (text)[url] format
    \( ([^)]+) \)                         # this group will be named "txt" too 
    \[ (\g<url>) \]                       # this one "url"
)

此模式使用分支重置功能(?|...|...|...),该功能允许交替保留捕获组名称(或编号).在该模式中,由于?<txt>组首先在替换的第一个成员中打开,因此第二个成员中的第一个组将自动具有相同的名称. ?<url>组也是如此.

This pattern uses the branch reset feature (?|...|...|...) that allows to preserve capturing groups names (or numbers) in an alternation. In the pattern, since the ?<txt> group is opened at first in the first member of the alternation, the first group in the second member will have the same name automatically. The same for the ?<url> group.

\g<url>是对已命名子模式?<url>的引用(就像别名一样,这种方式无需在第二个成员中重写它).

\g<url> is a reference to the named subpattern ?<url> (like an alias, in this way, no need to rewrite it in the second member.)

(?<=\P{P})检查url的最后一个字符是否不是标点字符(例如,用于避免使用右方括号). (我不确定语法,可能是\P{Punct})

(?<=\P{P}) checks if the last character of the url is not a punctuation character (useful to avoid the closing square bracket for example). (I'm not sure of the syntax, it may be \P{Punct})

这篇关于如何使用正则表达式查找所有Markdown链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆