使用正则表达式重新字符串匹配提取URL链接-Python [英] Extracting URL link using regular expression re - string matching - Python

查看:372
本文介绍了使用正则表达式重新字符串匹配提取URL链接-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试使用re api从文本文件中提取URL.以http://,https://和www开头的任何链接.

I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

该文件包含文本以及html源代码,html部分很容易,因为我可以使用BeautifulSoup提取它们,但是普通文本似乎更具挑战性. 我在网上发现了这似乎是URL提取的最佳实现,但是它在某些标签上失败了,特别是它无法处理标签并将其包含在URL中. 感谢您提供任何帮助,因为我本人对字符串匹配一点都不熟悉

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. any help is appreciated, because I'm not familiar with string matching at all myself

这是签名

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

示例:

http://www.website.com/science/</span></a><o:p></o:p></span></div><div
www.website.com/library/</span></a></span></i><span
http://awebsite.com/Groups</a><div>

推荐答案

re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

[^\s<>"]+部分与任何非空格,非引号,非尖括号字符匹配,以避免与以下字符串匹配:

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>

这篇关于使用正则表达式重新字符串匹配提取URL链接-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆