从 HTML 链接中提取 URL 的正则表达式 [英] Regular expression to extract URL from an HTML link

查看:92
本文介绍了从 HTML 链接中提取 URL 的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 新手.我正在学习正则表达式,但我需要这里的帮助.

I’m a newbie in Python. I’m learning regexes, but I need help here.

HTML 源代码如下:

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

我正在尝试编写一个只打印出 http://ptop.se 的工具.你能帮我吗?

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

推荐答案

如果您只需要一个:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

如果您有一个很长的字符串,并且想要其中的每个模式实例:

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

其中 s 是您要在其中查找匹配项的字符串.

Where s is the string that you're looking for matches in.

正则表达式位的快速解释:

r'...' 是一个原始"代码细绳.它使您不必像往常一样担心转义字符.(\ 特别是——在原始字符串中,\ 只是一个 \.在常规字符串中,您必须执行 \\ 每次都在正则表达式中.)

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?"表示匹配href=",可能后跟一个'.可能"因为很难说您正在查看的 HTML 有多糟糕,而且引号也不是严格要求的.

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

将下一位括在()"中;说把它做成一个组",意思是把它分开,单独归还给我们.这只是表达这是我感兴趣的模式的一部分"的一种方式.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"<代码>[^\'";>]+"表示匹配任何不是'">或空格的字符.本质上,这是作为 URL 结尾的字符列表.它让我们避免尝试编写一个可靠匹配完整 URL 的正则表达式,这可能有点复杂.

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

另一个答案中使用 BeautifulSoup 的建议还不错,但它确实引入了更高级别的外部要求.此外,它对您学习正则表达式的既定目标没有帮助,我认为这个特定的 html 解析项目只是其中的一部分.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

这很容易做到:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

无论如何,一旦您安装了 BeautifulSoup.

Once you've installed BeautifulSoup, anyway.

这篇关于从 HTML 链接中提取 URL 的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆