使用正则表达式通过Perl从纯文本中提取URL [英] Using regex to extract URLs from plain text with Perl

查看:95
本文介绍了使用正则表达式通过Perl从纯文本中提取URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用Perl正则表达式从纯文本中提取具有特定扩展名的特定域(可能具有可变子域)的所有URL?我已经尝试过:

How can I use Perl regexps to extract all URLs of a specific domain (with possibly variable subdomains) with a specific extension from plain text? I have tried:

my $stuff = 'omg http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif dfgdfg http://shomepage.com/woot.gif aaa';
while($stuff =~ m/(http\:\/\/.*?homepage.com\/.*?\.gif)/gmsi)
{
print $1."\n";
}

它可怕地失败了,并给了我

It fails horribly and gives me:

http://fail-o-tron.com/bleh omg omg omg omg omg http://homepage.com/woot.gif
http://shomepage.com/woot.gif

我以为不会发生这种情况,因为我使用的是.*?,它应该是非贪婪的,并且给我最小的匹配.谁能告诉我我在做什么错? (我不想使用任何超级复杂的罐装正则表达式来验证URL;我想知道我做错了什么,所以我可以从中学习.)

I thought that wouldn't happen because I am using .*?, which ought to be non-greedy and give me the smallest match. Can anyone tell me what I am doing wrong? (I don't want some uber-complex, canned regexp to validate URLs; I want to know what I am doing wrong so I can learn from it.)

推荐答案

URI ::查找是专门为解决此问题而设计的.它将找到所有URI,然后您可以对其进行过滤.它具有一些试探法来处理诸如尾随标点符号之类的事情.

URI::Find is specifically designed to solve this problem. It will find all URIs and then you can filter them. It has a few heuristics to handle things like trailing punctuation.

更新:最近更新为处理Unicode.

UPDATE: Recently updated to handle Unicode.

这篇关于使用正则表达式通过Perl从纯文本中提取URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆