使用Python从Markdown提取URL和锚文本 [英] Extracting URL and anchor text from Markdown using Python

查看:98
本文介绍了使用Python从Markdown提取URL和锚文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从Markdown中提取锚文本和相关的URL.我已经看到了这个问题.不幸的是,答案似乎并没有完全回答我想要的.

I am attempting to extract anchor text and associated URLs from Markdown. I've seen this question. Unfortunately, the answer doesn't seem to fully answer what I want.

在Markdown中,有两种插入链接的方法:

In Markdown, there are two ways to insert a link:

[anchor text](http://my.url)

示例2:

[anchor text][2]

   [1]: http://my.url


我的脚本如下所示(请注意,我正在使用 regex ,而不是re):

import regex
body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][4]\r\n\r\n  [1]: http://yahoo.com"

rex = """(?|(?<txt>(?<url>(?:ht|f)tps?://\S+(?<=\P{P})))|\(([^)]+)\)\[(\g<url>)\])"""
pattern = regex.compile(rex)
matches = regex.findall(pattern, body_markdown, overlapped=True)
for m in matches:
    print m

这将产生输出:

('http://google.com', 'http://google.com')
('http://yahoo.com', 'http://yahoo.com')

我的预期输出是:

('inline link', 'http://google.com')
('non inline link', 'http://yahoo.com')


如何正确地从Markdown捕获锚文本?


How can I properly capture the anchor text from Markdown?

推荐答案

如何正确地从Markdown捕获锚文本?

How can I properly capture the anchor text from Markdown?

将其解析为结构化格式(例如html),然后使用适当的工具提取链接标签和地址.

Parse it into a structured format (e.g., html) and then use the appropriate tools to extract link labels and addresses.

import markdown
from lxml import etree

body_markdown = "This is an [inline link](http://google.com). This is a [non inline link][1]\r\n\r\n  [1]: http://yahoo.com"

doc = etree.fromstring(markdown.markdown(body_markdown))
for link in doc.xpath('//a'):
  print link.text, link.get('href')

哪个会吸引我:

inline link http://google.com
non inline link http://yahoo.com

另一种方法是编写自己的Markdown解析器,这似乎是错误的工作重点.

The alternative is writing your own Markdown parser, which seems like the wrong place to focus your effort.

这篇关于使用Python从Markdown提取URL和锚文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆